How to build an agent evaluation dataset
Your agent looked perfect in staging. It completed every demo task, returned clean outputs, and passed every test you ran. Then it went to production and sent an email to the wrong person. Research from the WorkBench project found that GPT-4, running as a ReAct agent, completed only 43 percent of realistic workplace tasks correctly. Those errors were not abstract scoring misses. They produced real wrong actions with real consequences. The gap between sandbox performance and production behavior often comes down to the dataset, not the tuning.
TL;DR
Most agent pilots fail to reach production due to evaluation gaps.
A dataset needs three layers: task records, trajectory logs, and a rubric.
Pull 50 production traces to start. Prioritize routine successes and known failure cases.
Calibrate your automated judge against human labels before trusting it at scale.
Version-lock your rubric so comparisons across agent updates stay valid.
Why agents break evaluation tools designed for LLMs
Final-answer scoring works for LLMs because the entire behavior is in the output. Agents don't work that way. An agent selects a tool, defines the arguments, processes the result, and determines the next step. Any of those decisions can be wrong while the final output still looks plausible.
That architectural difference is why static benchmarks miss the failures that matter. A 2026 industry study found 88 percent of AI agent pilots fail to reach production. The leading blocker was evaluation gaps, cited by 64 percent of leaders. Teams couldn't measure what broke, so they couldn't fix it before it hit users.
Adding more test cases to your existing LLM eval suite doesn't close this gap. LLM evaluation measures text quality at a single point, whereas agent evaluation must measure decision quality across a sequence of steps. You need a different artifact entirely.
What an agent evaluation dataset contains
An agent evaluation dataset has three layers, and confusing them is the most common reason evaluation programs stall.
Task records
A task record holds the input side of one agent run: the user's intent, any context passed to the agent, and the expected outcome. The expected outcome is the correct final state of the world: a database row written, a ticket created with specific fields, a tool called with valid parameters. WorkBench calls this outcome-centric evaluation: each task has a clear outcome that agents are judged against. If the outcome is ambiguous, the task record is not ready to use.
Trajectory logs
The trajectory log is the step-by-step trace of what the agent did: which tools it called, in what order, and with what arguments. Trajectory logs differentiate agent evaluation from LLM evaluation. A response can look correct while the underlying trace reveals errors. For example, the agent might use a stale cache or call the right tool with wrong parameters. Evaluating only the final output leaves those failure modes invisible.
Ground truth set and rubric
A ground truth set consists of task records with expert scores attached. It anchors everything else. The set must contain tasks (routine and edge cases), expert scores tied to a written rubric, and version history. Version history prevents score comparisons from blurring as the agent evolves. Rubrics are mandatory. Without them, two reviewers scoring the same trace reach different conclusions for different reasons, and your dataset becomes noisy.
Safety-focused datasets carry the same layered structure with an added dimension. SHADE-Arena pairs each benign task with a harmful variant to test whether an agent takes subtle wrong actions and whether it can detect them. That pairing makes the evaluation dataset a tool for measuring both task performance and behavior boundaries, not just output quality.
Step 1: Pull tasks from production traces
Production logs offer the most direct route to a useful dataset. Every trace your agent generates contains the raw material for a task record. Structuring that material into scorable task records is the work.
For each trace, extract four fields:
User intent: the original request or trigger
Tool calls made: the sequence, arguments, and responses
Context retrieved: any data the agent pulled before acting
Final output: what was returned or executed
Sampling strategy matters more than volume at this stage. Pull traces that cover successes first. Routine traces define what normal looks like. Then pull traces from known failure categories: tasks where users complained, tasks that generated escalations, tasks that triggered retries. Twenty success traces and ten failure traces give you more calibration value than two hundred randomly sampled runs.
Evaluation must assess sequential decision-making across dynamic environments. This includes planning, tool use, and memory. Your sampling should reflect the actual distribution of decision complexity in your system. A single-step lookup task and a five-step workflow with branching logic are different evaluation problems and need separate task record types.
Once extracted, organize each trace into a task record with a unique ID and the four fields above. Add a status field you'll populate in the next step. That status field holds the human judgment.
Step 2: Build your ground truth set and scoring rubric
Skipping this step causes most agent evaluation programs to drift. Without a rubric, every reviewer is effectively running a different evaluation.
A rubric for agent evaluation has four components:
Pass/fail verdict: did the agent accomplish the user's intent through the correct sequence of actions?
Issue tags: a taxonomy of failure modes: wrong tool called, correct tool with wrong parameters, hallucinated context, skipped validation, correct output from incorrect path
Severity level: whether the failure would produce a wrong action (high), a degraded but recoverable outcome (medium), or an efficiency loss only (low)
Expected behavior note: a short description of what the correct trajectory would have looked like
Scoring your ground truth set
Have domain experts score 30 to 50 task records using the rubric. Label Studio's human-in-the-loop evaluation interface provides dedicated fields for each rubric component, so every reviewer works from the same structure rather than interpreting criteria independently. A turn-level filter lets reviewers examine User, Assistant, and Tool turns in isolation. The filter helps with multi-step traces: a reviewer looking at 15 turns can jump directly to the decision point they're scoring.
The 2025 AI Agent Index found that only 4 of 30 indexed agents provided agent-specific safety evaluations. Another 135 of 240 safety-related fields had no public information available. Most teams are not doing this work. A rubric with 30 scored examples puts you ahead of nearly every agent currently in production.
Lock the rubric before moving on
Version your rubric the moment your experts finish their first scoring pass. Any change to rubric definitions after calibration begins invalidates your ability to compare scores across time. If you discover a new failure mode, add it as a tagged issue type without altering existing definitions. The rubric is a living document, but its versioning must be deliberate.
Step 3: Calibrate your automated judge against human labels
You cannot run human reviewers over every production trace indefinitely. A ground truth set makes automation trustworthy.
Run your LLM-as-judge over the same 30 to 50 records your experts scored. Compute agreement by rubric component: pass/fail first, then issue tags, then severity. High pass/fail agreement with low issue-tag agreement indicates the judge catches gross failures but misses the nuance separating recoverable errors from harmful ones.
The Goal-oriented Performance Assessment (GPA) framework computes alignment metrics on test runs or sample traces, applied by human evaluators or automated systems. Your ground truth set supplies the benchmark: the human scores are the reference. Where the judge diverges, two diagnostic options exist: an ambiguous rubric for the task type or a poorly specified judge prompt. Work backward from the lowest-agreement task categories to find out which.
Sense Street used Label Studio Enterprise to scale this kind of structured human review operation. The platform produced a 120 percent increase in annotator efficiency and a 150 percent increase in labels, per the Sense Street case study. Human calibration at scale is operationally feasible. The assumption that it is too slow is usually a process problem, not a capacity problem.
One real limit: LLM-as-judge calibration only works if your ground truth set covers the failure modes your agent actually encounters. For a newly launched feature or a task type with no production history yet, you have no traces to draw from. Synthetic task generation is the only option in that situation. Synthetic data carries its own distribution gaps that won't surface until real users arrive. Build your ground truth set from real traces as soon as they exist.
Keeping the dataset current as your agent changes
A dataset built once decays. Agents change when prompts change, when tools are added or deprecated, when underlying model versions update. A task record built for your agent's June behavior may no longer represent valid expected trajectories in September.
Sample new traces on a cadence. Don't treat building the dataset as a launch activity. Pull a fresh sample after you change the agent's tools or reasoning flow. A change in tool parameters is enough to invalidate expected trajectory steps.
Version-lock the rubric before each evaluation run. HumanSignal's framework for scaling evaluation identifies version history as a required component of a ground truth set. Without it, score comparisons blur across updates. If you changed the rubric between runs, you are not comparing like with like.
Use disagreement review to surface new failure modes. When your automated judge and human reviewers diverge on a task category that was previously high-agreement, that is a signal. Either a new failure mode appeared that the rubric doesn't yet describe, or the agent's behavior shifted and old task records no longer apply. Either way, the disagreement is diagnostic, not noise.
Gartner predicts 40 percent of enterprise AI agent initiatives will be cancelled before end of 2027 for lack of clear evaluation frameworks (Gartner, 2025). A versioned rubric and a maintained dataset are what "clear" looks like in practice.
The dataset is your institutional memory
The agent that looked fine in the demo but failed in production wasn't missing better infrastructure. It was missing a documented record of what correct behavior looks like at each step. Pull 50 traces from production this week. Identify 10 that represent routine success and 5 that represent known failures. Score them with a rubric. That 15-trace nucleus is your ground truth set, the anchor for every automated judge, calibration run, and disagreement review that follows. Once it exists, you stop guessing at what broke and start measuring it. The ground truth set guide and the human-in-the-loop trace evaluation walkthrough have the implementation detail when you're ready.
How many tasks do I need in a ground truth set?
A reliable ground truth set starts with 30 to 50 high-quality, human-verified tasks. While synthetic datasets can reach thousands of entries, practitioners find that 20 to 30 real production logs are more valuable for catching the long tail of user behavior than large volumes of generated data.
When does LLM-as-judge fail for agent evaluation?
Automated judges struggle with tasks requiring subjective taste or specific domain expertise where "correctness" is not explicitly stated in the prompt. Because these judges can hallucinate their own reasoning, they are prone to providing overly positive feedback unless calibrated against a human-scored ground truth set.
How often should I refresh my evaluation dataset?
You must update your dataset whenever you change the agent's tools, reasoning prompts, or underlying model versions. Because agent behavior is a moving target, intermediate steps like tool selection or argument formatting can shift even if the final goal remains the same, making old trajectory logs obsolete.
How do I evaluate agents that have side effects like database writes?
Use outcome-centric evaluation to judge the agent by the final state of the environment rather than its text output. The WorkBench project recommends using sandbox environments where you can verify if the correct database row was written or the specific API call was executed with valid parameters.
What data should flow from my tracing platform into a labeling environment?
Your tracing platform should export the full trajectory: the original user intent, the sequence of tool calls with their arguments, the retrieved context, and the final output. In Label Studio, you can then filter these turns to allow experts to score specific decision points within the multi-step trace.