How to use LLM-as-judge for agent evaluation with Label Studio
Many teams setting up an LLM judge for agent evaluation find that scores can shift between runs. There is no way to explain why a trace passed or failed, and a growing suspicion the judge is measuring something other than what matters. Recent industry research shows that 93 percent of teams struggle to implement LLM judges: inconsistent scoring, runaway costs, and bias they cannot locate. The failure stems from workflow gaps, not model quality. The judge was deployed without rubrics, without a ground truth set, and without any way to catch the cases where it was wrong. A four-step calibration workflow (rubrics, ground truth, calibration run, and disagreement review) turns judge output into a signal you can act on.
TL;DR
93 percent of teams struggle to implement LLM judges: inconsistent scoring, runaway costs, and bias they can't locate.
Single-pass judges evaluate final outputs, missing tool call errors and trajectory failures.
A calibrated workflow needs four components: rubrics, ground truth, calibration, and disagreement review.
Agent-specific rubrics must cover tool selection, error recovery, and unnecessary loops.
Aim for 75–90 percent judge-to-human agreement before running against production data.
Why single-pass LLM judges miss agent failures
A single LLM judge reading a final agent response cannot see what the agent did to produce it. It sees the text. It does not see whether the agent called the right tool, looped unnecessarily, or failed to recover from a tool error. A 2026 survey on agent evaluation named this the defining limitation of single-pass evaluation. Judges suffer from cognitive overload on multi-step trajectories and hallucinate their scores because they can't verify how agent actions affect the real world.
Wasteful trajectories, wrong tool selection, and accidental error recovery can all sit underneath a coherent response. A judge scoring only the output can't tell you whether that path will recur at scale.
The four building blocks of a calibrated judge workflow
Plugging a prompt into GPT-4o and pointing it at agent outputs is the first step of every team's evaluation story. It is rarely the last, because uncalibrated judges compound the problem they were meant to solve. A multi-agent meta-judge study (arXiv, Apr 2025) shows that rubric design and multi-agent scoring improve accuracy. These methods increased judgment accuracy by 15.55 percent over raw LLM scores. Structure drives that improvement.
HumanSignal's scaling evaluation for agent workflows framework defines four components that produce a judge you can trust.
Rubrics
A rubric translates "did the agent do a good job?" into scorable, specific dimensions. Each dimension names a behavior and defines passing and failing criteria. Generic rubrics about helpfulness and tone miss agent-specific behavior. Rubrics need to cover the trajectory.
Ground truth sets
A ground truth set is a curated collection of agent traces that a domain expert has already scored against the rubric. It anchors every automated run. Without it, the judge has no fixed reference point, and scores drift with prompt phrasing, token order, and stochasticity.
Calibration
Calibration means running the LLM judge against the ground truth set and measuring where it agrees and disagrees with the expert labels, dimension by dimension. The calibration run reveals where the judge is reliable and where it diverges from the expert. Those gaps aren't failures; they're the data you need to improve the judge.
Disagreement review
Disagreement review takes the cases where the judge and the expert disagree, adjudicates them, and uses the outcome to refine the rubric or judge prompt. It closes the feedback loop, turning calibration from a diagnostic into a treatment.
Write agent-specific rubrics, then build your ground truth set
Generic evaluation dimensions do not expose agent-specific failures. "Helpfulness" does not tell you whether the agent chose the correct tool for the task. "Tone" does not tell you whether the agent recovered from a failed API call or silently continued on bad data. Agent rubrics need to cover the trajectory directly.
Rubric dimensions for agent traces
Agent-specific rubrics need at least three dimensions standard LLM rubrics leave out:
Tool selection accuracy: Did the agent invoke the tool appropriate for this step, or did it default to a general-purpose call when a more specific one was available? Pass/fail per step.
Error recovery: When a tool returned an error or null result, did the agent re-plan appropriately? A score of "passed" requires evidence of re-planning, not just a non-crash.
Loop detection: Did the agent revisit the same retrieval or reasoning step more than once without a state change that justified it? Flag as a loop if the same tool fires twice on equivalent inputs.
Binary or ternary judgments work better than 1-5 scales for these dimensions. Hamel Husain argues that tracking 1-5 scores without defining what separates a 3 from a 4 produces noisy data no one can act on. A binary pass/fail anchored to a specific criterion gives you a number with a cause.
Building the ground truth set in Label Studio
With Label Studio's agentic traces modality, you can examine step-by-step reasoning and tool calls instead of just the final response. The multi-turn chat evaluation template supports the full trajectory across turns, so you can mark failures at the step where they occurred rather than inferring them from the output.
If your agent runs through an observability layer, you can import those traces directly into Label Studio using the agent trace integration. This pulls real production traces into the evaluation workspace without manual export steps. Have a domain expert score 50 to 100 traces against your rubric. The labeled collection becomes the ground truth anchor for every calibration run that follows.
Calibrate the judge against your ground truth set
Most teams skip calibration because it feels like overhead, but it determines whether the rest of the workflow produces usable data.
Choose the right judge model
Judge model selection has a measurable effect on score reliability before you write a single rubric. A 2025 IJCNLP study shows that judge model choice drives positional bias. This factor outweighs task complexity, output length, or quality gaps between candidates. NeurIPS 2024 research found a direct correlation: models score their own outputs higher when they can recognize them. A GPT-based agent judged by GPT-4o will receive inflated scores for stylistic patterns the judge itself generates. Use a different model family for the judge.
Run the calibration and read the agreement scores
With the judge model selected and the judge prompt drafted, run it against your labeled ground truth set using Label Studio's Prompts feature. Prompts runs LLM-as-judge scoring at scale and compares automated scores against your human benchmarks, all without a separate export step.
After the run, check agreement per rubric dimension using Label Studio's per-label agreement metrics. Agreement across the full rubric masks the variance that matters. A judge can agree with experts 80 percent of the time overall while diverging on "error recovery" 40 percent of the time. The dimension-level view shows exactly where the judge is unreliable.
Aim for 75 to 90 percent agreement with expert labels before running the judge on unlabeled data. Below 75 percent, the judge introduces more noise than it removes. Above 90 percent, check whether the rubric is too coarse. Binary pass/fail on trivial dimensions is easy to agree on and tells you little.
When to skip the LLM judge entirely
A calibrated LLM judge adds latency and cost for every evaluation it runs. When agent outputs are deterministic (fixed-format JSON, SQL, compiler-checked code), run the external verifier. A language model reasoning about correctness is slower and more expensive. When a tool can confirm the agent's action directly, skip the judge. The calibration workflow here applies to agent outputs that are non-deterministic or use natural language, where no ground-truth oracle exists.
Route disagreements to human review and feed scores back to the agent
In production, a calibrated judge will encounter traces it handles badly: low-confidence scores, rubric-boundary outputs, and adversarial inputs that shift its behavior. Prompt injections in agent outputs let the evaluand influence its own score, according to a 2026 security survey. Routing these cases to humans isn't optional overhead; it's the control layer that keeps your scoring pipeline trustworthy.
Setting up disagreement review in Label Studio
Label Studio's hybrid evaluation mode queues traces for expert review when scores fall below a confidence threshold or a secondary check disagrees. The expert sees the same trace the judge saw, makes an adjudicated call, and records a label.
Adjudicated labels close gaps in the ground truth set and flag rubric items that produce consistent disagreement. Traces the judge handled poorly were likely underrepresented in the original labeled sample, so adding them improves the next calibration run. Ambiguous rubric language is a specification problem the rubric itself must resolve.
Closing the loop to the agent
Industry data shows that teams using a multi-judge consensus approach (combining LLM judges with human-designed rubrics) reach 97 to 98 percent accuracy. Either method alone sits at roughly 80 to 85 percent. The feedback loop produces that accuracy gain.
Label Studio's Prompts workflow supports iterating on judge prompts and agent prompts using the same labeled data. When disagreement review surfaces a repeated tool selection error, that finding informs a change to the agent's system prompt or fine-tuning dataset. The next evaluation run shows whether the fix held.
When evaluation stops being a cost center
Most teams that hit the 93 percent wall assumed the problem was the judge model. A better model didn't fix inconsistent scores because the inconsistency was in the workflow. Unverified judge output is a vibe check with a higher invoice. Calibrated output, grounded in expert labels and routed through human review when it diverges, produces a score you can read, explain, and defend.
The more durable gain comes after the loop closes. When adjudicated disagreements update the ground truth set and drive the next prompt revision, you stop just measuring and start improving the agent. HumanSignal's evals workflow gives you the rubric, calibration, and disagreement-review tools to build that loop on your own agent traces. Evaluation becomes a development practice, and scores tell you what changed, what held, and what to fix next.
How many ground truth examples do I need for calibration?
Most teams start with 50 to 100 expert-labeled traces to establish a statistically meaningful baseline. This sample size is large enough to surface common judge biases, such as favoring assertive language over accuracy, while remaining small enough for a domain expert to label manually in a single session.
What is the ideal agreement score between a judge and an expert?
Aim for 75 to 90 percent agreement before deploying a judge to production data. If agreement is below 75 percent, the judge is too noisy to trust; if it exceeds 90 percent, your rubric may be too simple to capture the nuanced failures that actually occur in complex agent trajectories.
How do I prevent a judge from favoring its own outputs?
Use a different model family for the judge than you use for the agent to mitigate self-preference bias. Research from NeurIPS 2024 shows a linear correlation between a model's ability to recognize its own stylistic patterns and its tendency to assign them higher scores.
Can an agent manipulate its own evaluation score?
Yes, agent outputs can contain prompt injections designed to hijack the judge's instructions and force a passing score. A 2026 security survey identifies this as a primary risk for automated pipelines, requiring a disagreement review workflow to catch and flag adversarial patterns.
When should I use code-based evaluators instead of an LLM judge?
Use deterministic code-based checks for any output with a fixed schema, such as JSON, SQL, or compiled code. An LLM judge adds unnecessary latency and cost when a tool can verify the result directly; reserve the judge for non-deterministic reasoning and natural language responses where no ground-truth oracle exists.