NewTemplates and Tutorials for Evaluating Agentic AI Traces

What Metrics Does Encord Use to Measure Annotation Quality?

No single metric captures annotation quality. A project can have high inter-annotator agreement and still produce poor training data if all annotators share the same misunderstanding. A project can hit throughput targets and fail on accuracy. Quality measurement in annotation requires multiple signals used together.

Here is what Encord surfaces and how to use each metric.

TL;DR

  • Encord uses IoU as the primary IAA metric for geometric annotation, broken down by annotator pair and label class.
  • Ground truth accuracy compares annotator output against a verified reference — the right tool for catching systematic bias that IAA alone misses.
  • Rework rate is an underrated quality proxy: high rates on specific task types usually signal guideline problems, not just annotator errors.
  • Encord's quality metrics were built for computer vision and do not natively surface Krippendorff's alpha or RLHF-appropriate measures.

Inter-annotator agreement

IAA measures how consistently different annotators label the same task. Encord calculates this when tasks are routed to multiple annotators through consensus annotation workflows.

For geometric annotation tasks like bounding boxes and segmentation masks, Encord uses IoU (Intersection over Union) as the primary agreement measure. A score of 1.0 means annotators drew identical shapes; lower scores indicate divergence.

IAA is surfaced at the project level and broken down by annotator pair and by label class. This lets project managers identify whether disagreement is random — spread across all annotators and classes — or systematic, concentrated in specific annotators or specific label types. Systematic disagreement in a specific class typically signals a guideline problem. Systematic disagreement from a specific annotator typically signals a training problem.

Ground truth accuracy

Ground truth accuracy compares annotator outputs against a verified reference set: pre-labeled tasks with known-correct answers seeded into annotation queues. Unlike IAA, which only measures consistency between annotators, ground truth accuracy measures correctness against an external standard.

Encord expresses this as a coefficient from 0 to 1, where above 0.9 is generally considered acceptable. Low ground truth accuracy combined with high IAA is the signature of systematic bias: annotators consistently agree on the wrong answer.

The quality of ground truth data matters as much as the quantity. Ground truth annotations that do not reflect real data distribution introduce sampling bias. Teams that borrow ground truth from external labeled datasets rather than creating domain-specific reference sets frequently encounter this problem.

Throughput and performance analytics

Encord's dashboards surface throughput metrics: tasks completed per hour, tasks per annotator, completion rates, and rework rates — how often tasks are sent back for correction after review.

Rework rate is a useful quality proxy. High rework rates on specific task types or from specific annotators indicate quality problems that IAA scores might not surface, particularly for tasks where reviewer judgment is the quality gate.

Annotator performance dashboards allow admins to compare individual throughput and quality metrics. This data supports workforce decisions: identifying strong annotators for complex tasks, flagging those who need additional training, and calibrating task routing accordingly.

What the metrics do not cover

Encord's quality metrics were built for computer vision. For text annotation, standard NLP agreement measures like Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha are not natively surfaced in the platform's dashboards. Teams doing text annotation typically compute these metrics externally.

For RLHF and preference annotation, the quality framework does not apply in the same way. Agreement between preference annotators is expected to be lower for subjective tasks, and the diversity of preference signals is part of the training data, not noise to eliminate. Encord's metrics do not distinguish these cases.

Label Studio's quality measurement approach

Label Studio Enterprise surfaces IAA metrics including Krippendorff's alpha — a statistic that works across different annotation types, handles missing data, and is more appropriate for text and categorical annotation tasks than IoU-based measures.

Ground truth comparison, rework tracking, and annotator performance analytics are also available. For RLHF workflows, quality measurement is built into the preference collection interface: annotator responses produce comparable, aggregable preference signals rather than expecting a single ground truth resolution.

You can check out our in-depth comparison of Label Studio and Encord here, or talk to an expert at HumanSignal about annotation quality measurement for your program.

Frequently Asked Questions

What does IoU measure in annotation quality?

Intersection over Union measures geometric overlap between two annotations of the same object. A score of 1.0 means two annotators drew identical shapes. Lower scores indicate divergence in where they drew boundaries. Encord uses IoU as the primary agreement metric for bounding boxes and segmentation masks.

What does a low ground truth accuracy score indicate in Encord?

Low ground truth accuracy means annotators are producing labels that diverge from the verified reference standard. Combined with high IAA scores, it typically indicates systematic bias: annotators consistently agree with each other but are consistently wrong relative to the ground truth.

What is rework rate and why does it matter?

Rework rate measures how often tasks are sent back for correction after review. High rework rates on specific task types or from specific annotators signal quality problems that agreement metrics between annotators might not surface. It is a useful leading indicator of labeling guideline problems.

Does Encord support Krippendorff's alpha for text annotation?

No. Encord's IAA framework is built for geometric annotation and uses IoU as the primary measure. For text classification and NLP tasks, teams typically need to compute Krippendorff's alpha or similar statistics externally.

How should annotation quality metrics be interpreted together?

High IAA with high ground truth accuracy indicates consistently correct annotation. High IAA with low ground truth accuracy indicates systematic bias. Low IAA with low ground truth accuracy indicates both random error and guideline problems. Low IAA with high ground truth accuracy is rare but can indicate annotators are independently finding correct answers through different paths.

How does Label Studio handle quality metrics for preference annotation?

Label Studio's preference collection interfaces produce structured, comparable signals across annotators rather than expecting convergence on a single ground truth. Quality is assessed on the consistency and coverage of preference signals collected, not on agreement rates that would be inappropriate for subjective tasks.

Related Content