In the Loop: Cohen and Fleiss' Kappas
In this episode of In The Loop, ML Evangelist Micaela Kaplan walks through Cohen’s Kappa and Fleiss’ Kappa—two essential metrics for evaluating annotator reliability across a dataset. Learn how to calculate them and use the results to improve labeling quality.
Transcript
Hi, I’m Micaela Kaplan, the ML Evangelist at HumanSignal, and this is In The Loop—the series where we help you stay in the loop with all things data science and AI.
Last week, we discussed task-based agreement metrics for evaluating annotation quality on a per-task basis. This week, we’re covering two metrics that measure annotator reliability at the dataset or project level: Cohen’s Kappa and Fleiss’ Kappa.
Let’s start with Cohen’s Kappa, developed by Jacob Cohen in 1960. It measures the agreement between two raters classifying N items into C categories, accounting for agreement that could happen by chance. The formula is:
Kappa = (Observed Agreement – Expected Agreement) / (1 – Expected Agreement)
- Observed agreement is how often the raters agree.
- Expected agreement is how often we would expect them to agree by chance.
Example:
Two annotators each label 50 samples.
- They both say “yes” on 15 samples and both say “no” on 12 samples.
- Observed agreement = (15 + 12) / 50 = 0.54
- Expected probability of “yes” = (25/50) × (28/50) = 0.28
- Expected probability of “no” is calculated similarly.
- Expected agreement = 0.50
- Kappa = (0.54 – 0.50) / (1 – 0.50) = 0.08, which is considered slight agreement
This result suggests that retraining the annotators could improve consistency.
Cohen’s Kappa works well for two annotators, but what if you have more? That’s where Fleiss’ Kappa comes in.
Named after Joseph Fleiss, Fleiss’ Kappa generalizes Cohen’s Kappa to scenarios with more than two annotators, each selecting from a fixed number of classes. The formula is the same structure, but the way we calculate observed and expected agreement differs.
Fleiss’ Kappa Example:
Suppose four annotators each label seven items.
- First, count how many times each class appears for each task.
- In total, “yes” appears 15 times and “no” 13 times (28 annotations total).
- Class probabilities:
- Yes = 15 / 28 = 0.536
- No = 13 / 28 = 0.464
- Expected agreement = 0.536² + 0.464² = 0.503
To calculate observed agreement:
- Use the formula based on the number of tasks (N = 7) and annotators per task (n = 4).
- Compute the sum of squared class counts across all tasks.
- Sum = 86
- Observed agreement = (1 / [7 × 4 × (4 – 1)]) × 86 = 0.696
Final Fleiss’ Kappa = (0.696 – 0.503) / (1 – 0.503) = 0.388, which is considered fair agreement.
In both examples, the metrics help you assess whether annotators are consistent beyond chance. Scores range from 0 (no agreement) to 1 (perfect agreement), and can help guide training and quality control.
These metrics are most useful when annotating categorical data, like multiple-choice classification tasks. While today’s examples used binary classes, both metrics work with more categories.
What if your data isn’t categorical? Or if you have missing annotations?
In the next episode of In The Loop, we’ll wrap up this series with Krippendorff’s Alpha—a more flexible agreement metric that handles missing data and different data types.
That’s it for this week. Thanks for staying in the loop. Check out our other videos, and don’t forget to like and subscribe to stay in the loop.