How to evaluate AI agents in production
Let's be honest: Manually reviewing a 50-step agent trajectory sounds unscalable. When an AI agent executes dozens of tool calls and intermediate reasoning steps, human review feels like a bottleneck. Teams naturally default to using automated large language model (LLM) judges to score these runs. But deploying an LLM-as-a-judge creates a flawed echo chamber, because the automated judge often shares the blind spots of the agent it evaluates.
TL;DR
- Automated LLM judges share the blind spots of the agents they evaluate.
- Correct final outputs often mask broken reasoning and hallucinated tool calls.
- Non-deterministic agents require soft failure thresholds in CI/CD pipelines.
- Scaling human review requires transforming nested JSON trace data into readable visual trees.
- Domain experts provide the calibration necessary to catch errors that automated judges miss.
The shift from single-turn outputs to trajectory evaluation
For years, engineering teams evaluated large language models using static text metrics like BLEU or ROUGE. They fed a single prompt into the model and measured the accuracy of the single response. Autonomous agents break this model. Agents execute multi-step workflows, calling external APIs and adjusting their plans based on intermediate observations.
Evaluating only the final output creates a black box. An agent might successfully book a flight or resolve a customer support ticket, but the path it took matters. An agent tasked with refunding a customer might issue the refund correctly but fail to log the transaction in the compliance database. Evaluating a single turn scores this as a success because the customer received the money. A trajectory evaluation catches the missing database write. The agent could have hallucinated a search parameter or bypassed a security control before stumbling into the correct final state. Correct outputs can mask broken reasoning, a failure mode recently documented by Booking.com engineers. An agent that arrives at the right answer through wrong intermediate steps becomes a hidden risk. The system will fail when environmental variables shift slightly. Glass box evaluation opens up the trajectory. Reviewers examine the tool calls and intermediate logic to verify that the agent reached the correct destination using safe, repeatable steps.
Phase 1: Establish outcome-driven metrics and soft failure thresholds
Before evaluating if intermediate steps are correct, teams need to define what a successful trajectory actually looks like economically and technically.
Define economic success criteria
Technical accuracy alone does not justify the cost of running autonomous workflows. Over 40 percent of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or missing risk controls, according to Gartner. Outcome-driven metrics must be established before deploying code to production.
Track outcome-driven metrics like the cost per resolved claim and the number of overrides per 1,000 decisions (Gartner, 2024) rather than generic benchmarks. If an agent costs $2 in token usage to automate a $1 data entry task, the deployment is financially unviable.
Implement soft failure thresholds
Software testing relies on binary outcomes. The code either compiles or it fails. Non-deterministic agents require a different approach to continuous integration pipelines. A third-party API timeout can cause a task failure even when the agent's reasoning is perfect.
Soft failure thresholds manage this environmental noise. For example, Monte Carlo Data configures their pipelines to trigger a hard failure only if 33 percent of tests result in a soft failure. They also trigger hard failures if more than 2 soft failures occur total. A percentage-based reliability metric prevents transient network issues from blocking your deployment while still catching genuine logic regressions.
Phase 2: Audit agent traces for reasoning and tool execution
Catching logic regressions requires visibility into the agent's execution, which means working with structured data.
Map the execution framework
Evaluate a trajectory by logging every agent run against the 4 distinct pillars of the ITU-T F.748.46 framework: Perception, Planning, Memory, and Action.
Did the agent correctly perceive the user's intent from the initial prompt? Did it schedule the right sequence of tasks to gather information? Did it recall the necessary context from previous conversational turns? Did it execute the API call with the correct JSON syntax? A single agent run might trigger 5 database lookups, 2 external API calls, and 3 internal reasoning loops. Segmenting the trace data isolates the point of failure. A memory retrieval error requires a different engineering fix than a tool execution error.
Visualize hierarchical trace data
Agent traces generate nested JSON payloads. Reading raw JSON logs to find a hallucinated parameter within a 40-step execution path is impractical. Subtle reasoning errors disappear when reviewers must scroll through thousands of lines of code. Converting these logs into a format a human reviewer can quickly scan and grade makes the difference.
Label Studio Enterprise solves this by using ReactCode to transform JSON trace data into interactive tree visualizations. Reviewers can expand and collapse individual nodes representing the agent's internal thoughts, tool calls, and the resulting database queries. The visual structure allows evaluators to see the prompt the LLM received and compare it directly against the raw output returned by the database. The interface renders SQL results as readable tables, allowing you to pinpoint where an agent deviated from the optimal path.
Phase 3: Calibrate automated judges with human domain experts
Break the automated echo chamber
Scaling evaluation requires automation, typically in the form of an LLM-as-a-judge. You feed the agent's transcript to another model and ask it to grade the performance. Using an LLM judge with the same architecture as the agent creates shared blind spots, an architectural overlap frequently flagged by practitioners on Hacker News. A model prone to hallucinating certain API parameters will likely approve a trace where the agent makes that same hallucination.
Align technical metrics with human reality
Relying exclusively on automated scoring skews your measurements. Technical metrics dominate 83 percent of current evaluation research, leaving human-centered and economic assessments on the periphery, according to a recent ArXiv systematic review. Injecting human-in-the-loop evaluation calibrates the automated judges. Human experts grade a statistically significant sample of traces, establishing the ground truth that the LLM judge must learn to replicate. Once the experts score the traces, you feed those graded examples back into the LLM judge as few-shot examples. This creates a continuous improvement cycle where the automated system learns to recognize nuanced errors it previously ignored.
Deploy targeted domain expertise
The case for generic human review breaks down when evaluating specialized workflows. If you use crowdsourced labelers to evaluate medical or financial agents, you replace an LLM's hallucination with a layperson's guess. Your reviewers must possess the context the agent operates within. A generalist cannot determine if a SQL query pulled the correct financial compliance data.
In a recent healthcare GenAI evaluation, the Mind Moves team used Label Studio to coordinate 32 National Institutes of Health (NIH) subject matter experts. These domain experts proved more conservative than the GPT-4 judges running in parallel, catching subtle evidence support errors and clinical logic gaps the automated system missed. Human reviewers accepted only 50 percent of the early-stage system's outputs. Real domain experts provide the calibration needed to secure the deployment of high-stakes agentic workflows, forcing your models to meet professional standards.
Phase 4: Measure operational ROI and continuous drift
Evaluation is not a one-time hurdle you clear before launch. APIs deprecate endpoints. User phrasing shifts over time. The language models powering your agents receive hidden weight updates that change their reasoning patterns. Tracking the ongoing financial viability of your agent against the cost of maintaining it requires measuring the operational efficiency of your evaluation loop itself.
If your human reviewers spend 20 minutes struggling with a slow interface for every trace they grade, the evaluation process will make the project financially unviable. Tooling that accelerates the review cycle without sacrificing accuracy keeps the loop sustainable. Financial technology firm Sense Street implemented Label Studio Enterprise to evaluate trader models across 5 languages and multiple asset classes. Optimizing the interface increased annotations per labeler by 120 percent. That efficiency gain allowed the company to expand its team size by 400 percent without breaking unit economics.
Improving the human review layer sustains the continuous evaluation required to catch model drift, allowing you to build custom benchmarks for enterprise AI that evolve alongside your product.
Securing autonomous workflows
Manually reviewing multi-step agent trajectories becomes scalable when you pair hierarchical trace visualization with targeted domain expertise. Once you establish this evaluation loop, you stop guessing whether an automated judge missed a hallucinated API call. You gain a verifiable system where human experts continuously calibrate your automated metrics. A verifiable evaluation loop ensures your agents execute safe, repeatable workflows in production.