The Sweet Science of Subjective Evaluation
At PyTorch Conference Europe 2026, we ran a simple experiment. We handed attendees two pieces of French chocolate and asked them to evaluate what they tasted. On paper, the task seemed easy: choose a few flavor notes, rate how well the official description matched the sample, and move on. In practice, it turned into something more interesting. The exercise became a real test of how people interpret instructions, how subjective tasks behave at scale, and where data quality starts to break down when there is little or no time to train annotators.
The task
Participants completed two parts for each chocolate sample.
First, they selected flavor notes from a multiple-choice list. Options included labels like red berry, dark fruit, nutty, earthy, and bitter. They were not given a glossary or reference guide. Instead, they had to interpret those flavor labels on their own based on what they tasted.
Second, they read the official description of the chocolate they had just tasted and rated how accurate it felt on a scale from 1 to 10. In the interface, that score was shown as a draggable point on a scale that started at 5.
What we found
Once we reviewed the results, one pattern stood out right away: agreement was low for both samples. Overall consensus agreement landed at about 26% for each task, which confirmed what we suspected from the start. This was a highly subjective evaluation.
A more detailed breakdown made the source of that disagreement clearer. The overall agreement score reflected both parts of the task, but the two questions behaved very differently. Accuracy ratings showed much stronger agreement, at about 45% for one sample and about 41% for the other. Flavor selection agreement was much lower, at roughly 8% for one sample and about 12% for the other. Most of the disagreement came from the flavor-labeling portion, not the rating task.
The flavor patterns themselves were still meaningful. For Sample A, the most commonly selected notes were Dark Fruit, Bitter, and Red Berry. For Sample B, the most common were Caramel, Cream, and Hazelnut. That suggests people were not answering randomly. At the same time, we also saw crossover labels. Some participants selected caramel and hazelnut for Sample A, while others chose bright fruit for Sample B. That noise helps explain why agreement remained low.
The challenges
Even a simple chocolate-tasting task surfaced a few familiar annotation problems.
Limited training time created confusion. On a conference floor, people move fast. The task had to be intuitive enough to complete with almost no explanation. For the most part, it was. Still, some participants interpreted flavor labels differently, and others misunderstood the rating question, thinking they were being judged rather than judging the chocolate description itself.
Small setup details introduced noise. We also saw how easily the physical setup can affect data quality. Because the samples were arranged differently than some participants expected, a few people mixed up Sample A and Sample B. Even with images in the interface, that mismatch between the real-world setup and the digital workflow added noise to the data.
Subjectivity became more visible at scale. Chocolate tasting is inherently subjective. Without shared definitions or training, people naturally interpret the same sample in different ways. That variation is not an exception. It is part of the task itself. Once participation scaled up, the need for a workflow that was quick, stable, and easy to complete with minimal guidance became even more obvious.
What we learned
Task design matters most when training time is limited. When people have very little guidance, the task itself has to do more of the work. Instructions need to be clear. The order of operations should match what people intuitively expect. The interface should answer likely questions before they create confusion. If we ran this experiment again, we would reorder parts of the task, add clickable definitions for flavor categories, and improve the question introductions.
Subjective tasks need stronger agreement signals. When a task depends on personal judgment, variation is inevitable. That makes agreement metrics especially useful. In this case, the large number of participants made it easier to spot response patterns, but that kind of scale is not always practical in real workflows. Agreement metrics help show where annotators are converging, where they are diverging, and where additional review may be needed.
People need help identifying nuance. One of the clearest lessons from the conference floor was that most people do not naturally describe chocolate in detailed flavor terms. The most common reaction we heard was simply that the samples “tasted like chocolate.” That does not make the exercise a failure. It shows that nuanced evaluation depends on context, examples, and shared definitions, especially when participants are not trained tasters.
From taste to takeaways
At its core, this experiment points to something larger about modern model evaluation. No matter how sophisticated our metrics or benchmarks become, there is always an element of subjectivity that cannot be fully abstracted away. Whether people are judging the quality of a model response, the tone of an output, or how well something aligns with human expectations, those decisions still rely on human interpretation. Just as no two people experience flavor in exactly the same way, annotators bring their own context, preferences, and biases into evaluation tasks. The challenge is not to remove subjectivity completely. The challenge is to structure it well enough that it becomes useful.
That is where evaluation workflows matter. The goal is to capture rich human feedback, measure agreement, surface divergence, and turn subjective signals into something teams can actually learn from.