NEWLabel Studio 1.20.0 Now Available: Spectrograms, Time Series Sync, and more!
Guide

In the Loop: What is Annotator Agreement?

In this episode of In The Loop, ML Evangelist Micaela Kaplan explains annotator agreement—why it matters, how to measure it, and how it impacts data quality. Learn the basics of pairwise and aggregate agreement to evaluate annotator consistency and improve your labeling workflows.

Transcript

Hi, I’m Micaela Kaplan, the ML Evangelist at HumanSignal, and this is In The Loop—the series where we help you stay up to date on all things data science and AI.

This week, we’re starting a new video series on annotator agreement: what it is, why it matters, and how to use it to get high-quality data for training or evaluating models.

Let’s say you’re evaluating LLM outputs and want multiple annotators working in parallel for efficiency, bias reduction, and completeness. To trust the results, we need to ensure consistent, high-quality labels—even when there’s no clear “right” answer. That’s where annotator agreement comes in.

Human judgment can vary, especially on subjective tasks like emotion detection. Overlap helps solve this. Overlap defines how many annotators label each task and on what portion of the dataset. By overlapping a subset of tasks, we can measure how consistent annotators are, using this as a proxy for label quality and annotator performance.

For example, in LLM evaluation, you might assign at least two annotators to 20% of the tasks. This balances quality and efficiency—especially on straightforward tasks. For subjective ones, like emotion classification, you'd want higher overlap on more tasks.

With multiple annotations per task, we need a way to measure how similar the labels are. This is known as agreement. Agreement is an industry-standard proxy for annotation quality, and there are several inter-annotator agreement (IAA) metrics—also called inter-rater reliability metrics—that help quantify it.

Let’s look at two basic agreement types: pairwise and aggregate.

Pairwise agreement compares each annotator’s answer to every other annotator’s, scoring them as matches (1) or mismatches (0). You average the scores across all annotator pairs. This gives insight into annotator alignment and helps identify strong annotators—those who consistently agree with others—and weak ones who may need more training.

Example: Four annotators are choosing among three LLM responses (A, B, and C). If we compute pairwise agreement across all annotator combinations and get an average score of 0.1667, this low score signals disagreement and a need to investigate. You might refine the labeling instructions or revisit ambiguous cases.

Aggregate agreement takes a majority-vote approach. Using the same example, if two annotators choose A, and the others pick B and C, the majority answer is A. Since only half the annotators selected it, the aggregate agreement is 50%. This doesn't reveal how individual annotators compare, but it gives a sense of consensus and confidence in the chosen label.

So which metric should you use? It depends on your goal. If you're selecting the most reliable label for training or evaluation, aggregate agreement works well. If you’re evaluating annotator performance or monitoring task ambiguity, pairwise agreement is more useful.

Annotator agreement metrics help ensure consistent, reliable, high-quality data from your human annotators—essential for model training and evaluation.

This week, we covered exact match agreement using both pairwise and aggregate methods. Next time, we’ll explore statistical agreement metrics that account for chance, like Cohen’s Kappa and Krippendorff’s Alpha.

Thanks for watching In The Loop. Check out our other videos, and don’t forget to like and subscribe to stay in the loop.

Related Content