logofirst
logofirst

Annotation statistics

Beta documentation: Label Studio Enterprise v2.0.0 is currently in Beta. As a result, this documentation might not reflect the current functionality of the product.

Label Studio Enterprise Edition includes various annotation and labeling statistics. The open source Community Edition of Label Studio does not perform these statistical calculations. If you're using Label Studio Community Edition, see Label Studio Features to learn more.

Task agreement

Task agreement shows the consensus between multiple annotators when labeling the same task. There are several types of task agreement in Label Studio Enterprise:

You can also see how the annotations from a specific annotator compare to the prediction scores for a task, or how they compare to the ground truth labels for a task.

Matching score

A matching score assesses the similarity of annotations for a specific task. The matching score is used differently depending on which agreement metrics are being calculated.

Matching scores are used to determine whether two given annotations for a task, represented by x and y in this example, match.

The type of data labeling being performed affects how the matching score is computed. The following examples describe how the matching scores for various labeling configuration tags are computed.

Choices

For data labeling where annotators select a choice, the matching score for two given task annotations x and y is computed like follows:

TextArea

For data labeling where annotators transcribe text in a text area, the resulting annotations contain a list of text. The matching score for two given task annotations x and y is computed like follows:

The matching score for each aligned pair can be calculated in multiple ways:

The matching score is calculated by comparing the intersection of annotations over the result spans, normalized by the length of each span. For two given task annotations x and y, the matching score formula is m(x, y) = spans(x) ∩ spans(y).

Rating

For data labeling where annotators select a rating, the matching score for two given task annotations x and y is computed like follows:

Ranker

The matching score is calculated using the mean average precision (mAP) for the annotation results.

RectangleLabels

The method used to calculate the matching score depends on what choice you select on the project settings page from the following options:

PolygonLabels

The method used to calculate the matching score depends on what choice you select on the project settings page from the following options:

Agreement method

Agreement method defines how matching scores across all completions for a task are combined to form a single inter-annotator agreement score.

There are several possible methods you can specify on project settings page:

Complete linkage

Complete linkage task agreement groups annotations so that all the matching scores within a given group are higher than the threshold. The agreement score is the maximum group size divided by the total count of annotations.

Review the diagram for a full explanation:

Diagram showing annotations are collected for each task, matching scores are computed for each pair, and grouping and agreement score calculation happens as detailed in the surrounding text.

Single linkage

Single linkage task agreement groups annotations so that at least one of the matching scores within a given group is higher than the threshold. The agreement score is the maximum group size divided by the total count of annotations.

Review the diagram for a full explanation:

Diagram showing annotations are collected for each task, matching scores are computed for each pair, and grouping and agreement score calculation happens as detailed in the surrounding text.

No grouping

No grouping task agreement uses the mean average of all inter-annotation matching scores for each annotation pair as the final task agreement score.

Review the diagram for a full explanation:

Diagram showing annotations are collected for each task, matching scores are computed for each pair, the resulting scores are averaged for a task.

Example

One annotation that labels the text span “Excellent tool” as “positive”, a second annotation that labels the span “tool” as “positive”, and a third annotation that labels the text span “tool” as “negative”.

diagram showing example labeling scenario duplicated in surrounding text

The matching score for the first two annotations is 50%, based on the intersection of the text spans. The matching score comparing the second annotation with the third annotation is 0%, because the same text span was labeled differently.

The task agreement conditions use a threshold of 40% to group annotations based on the matching score, so the first and second annotations are matched with each other, and the third annotation is considered mismatched. In this case, task agreement exists for 2 of the 3 annotations, so the overall task agreement score is 67%.