NEWLabel Studio 1.20.0 Now Available: Spectrograms, Time Series Sync, and more!
Guide

In the Loop: Krippendorff's Alpha

In this episode of In The Loop, Micaela Kaplan breaks down Krippendorff’s Alpha—a flexible agreement metric for evaluating annotator consistency, even with incomplete or non-categorical data.

Transcript

Hi, I’m Micaela Kaplan, the ML Evangelist at HumanSignal, and this is In The Loop—the series where we help you stay in the loop with all things data science and AI.

This week, we’re wrapping up our annotator agreement series with Krippendorff’s Alpha.

Krippendorff’s Alpha, like Cohen’s and Fleiss’ Kappa, is a dataset-level metric for assessing annotator reliability. The key difference is flexibility. Krippendorff’s Alpha supports:

  • Any number of raters
  • Incomplete data
  • Multiple data types, including ordinal, nominal, and interval ratings

The formula is familiar:

Alpha = (Observed Agreement – Expected Agreement) / (1 – Expected Agreement)

What makes Krippendorff’s Alpha different is how we compute each term. Let’s walk through an example.

Example Setup

You have 4 raters evaluating 10 items, using a 1–5 scale. First, remove any items with only one rating—Krippendorff’s Alpha requires at least two. In this case, we remove item 10, leaving 9 items.

Define:

  • n = number of items = 9
  • q = number of rating categories = 5

Next, create a table of how often each score (1–5) was assigned per item. This is similar to the Fleiss’ Kappa table, but now with a 1–5 scale. You also compute:

  • Total ratings per item (rᵢ)
  • Average number of raters per item ()

Choosing a Weighting Scheme

Before calculating agreement, you must define what it means for two ratings to “agree.” This depends on how you treat your scale:

  • Nominal: No inherent order (e.g. colors, categories). Agreement = exact match.
  • Ordinal: Ordered but uneven (e.g. rankings, sentiment).
  • Interval: Ordered with equal spacing (e.g. rating scales).

For this example, we’ll use nominal agreement, meaning raters only agree if their scores match exactly.

Calculating Observed Agreement

We compute the observed agreement by averaging the agreement scores across items. For each item:

  • Use the provided formula to sum agreement scores
  • Normalize by the number of raters and rating pairs
  • Compute agreement for each item, then average

In our example, this gives a final observed agreement of 0.345.

Calculating Expected Agreement

Expected agreement is based on the likelihood of two raters agreeing by chance:

  • Count total times each score (1–5) was selected
  • Convert these counts into probabilities (πₖ)
  • Since we’re using nominal weighting, we square the probabilities and sum them

This gives an expected agreement of 0.22.

Final Krippendorff’s Alpha

Plugging in the values:

Alpha = (0.345 – 0.22) / (1 – 0.22) = 0.16

Krippendorff’s Alpha ranges from –1 to 1:

  • 1 = perfect agreement
  • 0 = agreement by chance
  • < 0 = systematic disagreement

A score of 0.16 indicates low agreement—better than random, but with room for improvement. This might suggest the need for better annotator training or clearer labeling instructions.

While Krippendorff’s Alpha is mathematically more complex than other metrics, it’s also more powerful. The good news is there are open-source libraries and online tools to help you calculate it. Understanding how it works helps you choose the right agreement metric for your task.

That’s it for this week. Thanks for staying in the loop.

Want more episodes? Check out our other videos. Don’t forget to like and subscribe to stay in the loop.

Related Content