NewTemplates and Tutorials for Evaluating Agentic AI Traces

How to evaluate agent tool use

Your observability dashboard shows HTTP 200s and sub-150ms latency across thousands of tool calls. The infrastructure looks healthy. Your dashboard cannot show you that the agent selected the wrong tool in one-third of ambiguous cases. It also misses malformed parameter names passed to a database write endpoint, and misread responses from a payment confirmation. None of those failures triggered an alert. Observing infrastructure and verifying tool-use correctness are different checks. The first confirms the pipes are running. The second confirms whether the right water is flowing.

TL;DR

Uptime and latency metrics don't catch selection errors, schema mismatches, or misread responses.

Tool-use failures fall into four categories, each needing a different evaluation rubric.

Ground truth datasets are most effective for agents using 15+ tools or high-stakes systems.

Domain experts spot wrong tool choices that engineers reviewing JSON will miss.

Pre-screening traces with automated labels cuts expert review volume without sacrificing accuracy.

Why observability dashboards miss tool-use failures

A standard APM dashboard tracks infrastructure health: did the request complete, how fast, did the server respond? None of those signals captures whether the agent chose the right tool in the first place.

Research confirms that the central problem has shifted from isolated tool invocation to multi-tool orchestration over long trajectories with intermediate state and execution feedback. Safety, cost, and verifiability constraints layer on top of that. Each step in the chain can fail silently. A tool call that returns a well-formed 200 response can still carry hallucinated field values or route to the wrong endpoint entirely.

Gartner projects that over 40 percent of agentic AI projects may face cancellation by end of 2027 if teams cannot demonstrate clear business value or manage escalating costs and risk controls. Most of those issues surface after deploying at scale, not during development. Teams need a second evaluation layer built for tool-call correctness. A better dashboard alone will not close that gap.

The four failure modes that define your evaluation criteria

VentureBeat's analysis of production deployments makes this plain: agent evaluation means judging multi-step reasoning chains and tool selection decisions. Binary pass/fail metrics from single-turn evaluation don't cover that. Judging these chains consistently means knowing which failure type you're measuring.

Selection error

The agent picks the wrong tool from its available set. This failure occurs most frequently, accounting for roughly 35 percent of all tool-use failures in production. Selection errors cluster in two conditions: when the agent has too many tools and when two tools have overlapping descriptions. Practitioners confirm these selection patterns directly. As one Hacker News contributor put it: "In my experience that's where most bad tool calls come from, not from missing descriptions but from ambiguous overlap between tools." Agents select tools less accurately once the available set exceeds 20 to 30 options. At 50 tools, it becomes a recurring problem.

Your evaluation rubric for this failure mode asks: given this user intent and this tool set, did the agent choose the tool a domain expert would choose?

Schema error

The agent selects the correct tool but constructs the call incorrectly: wrong parameter names, missing required fields, incorrect data types. Schema errors account for roughly 28 percent of failures. They often pass validation if the schema is loosely defined, which means they reach production undetected. Your rubric here asks: does the generated call match the documented schema, and do the values reflect what the user requested?

Execution error

The tool call is correctly formed but the external system returns an error the agent mishandles. Mishandled responses include rate limits, authentication failures, and timeout responses the agent interprets as success. Execution errors make up approximately 22 percent of failures. The rubric asks: did the agent correctly interpret the tool's response, and did it take the appropriate next action given that response?

Parsing error

The agent receives a valid response but pulls the wrong information from it, routing the wrong value into the next step. At roughly 15 percent of failures, parsing errors are the hardest to catch because the trace looks correct at every prior step. Gartner research quantifies the gap: 84 percent of IT leaders say additional technical controls are needed to manage and secure AI agents. Parsing failures represent the silent risk those technical controls need to address.

Your rubric for parsing errors asks: did the value pulled from this response match what the response contained?

How to build a ground truth dataset for tool calls

Evaluation is only as reliable as the ground truth you use for comparison.

Collect traces from high-ambiguity scenarios

Start by pulling traces from staging runs and shadow production traffic. Focus on the scenarios where your agent has the most room to fail. That means inputs that could plausibly trigger multiple tools, edge cases near schema boundaries, and calls that touch high-stakes systems like financial writes or external data mutations.

Tool selection plummets past 20 to 30 tools, so your dataset should include examples from the zones where similar tools overlap. Relying only on easy cases where one tool is obviously correct causes evaluation to overestimate real-world accuracy.

Abstract syntax tree matching does not reliably predict real-world execution success. Ground truth requires human judgment on actual execution outcomes, not structural similarity between a generated call and a reference call. HumanSignal's evaluation framework makes this point directly: outcome-based rubrics capture what accuracy metrics miss when intermediate tool decisions shape the final result.

Tag each trace by failure mode

For every tool call in your dataset, tag it with the failure mode category it represents: selection, schema, execution, or parsing. Tagging by failure mode lets you track each category separately and allocate expert review to the areas with the highest volume or highest stakes.

When to skip this step

With fewer than five tools and fully disjoint schemas, a labeled evaluation dataset may cost more time than the failures it would catch. Manual review of flagged traces is faster at that scale. The ground-truth approach pays off past roughly 15 tools, or when tool calls touch high-stakes systems: financial writes, external APIs, or data mutation. Below that threshold, build the dataset only if you have observed recurring failures.

How to review tool-calling traces with domain experts

Who reviews the trace matters most. An engineer can verify that the JSON is valid. A domain expert can tell you whether the agent selected the right tool for the job. Those are different evaluation goals.

A legal expert reviewing a contract-analysis agent will spot a case where the agent called a clause-extraction tool when it should have called a precedent-search tool. The JSON in both calls could be perfectly formed. The selection error is only visible to someone who understands what the user was trying to accomplish.

Set up the trace import

Pull traces from your observability stack directly into your evaluation environment. Teams can import traces into Label Studio via its LangSmith integration as well as connections to Langfuse and Braintrust. Traces move from your monitoring stack into your evaluation environment without manual export steps.

Filter by role and assign labels

Once traces are imported, filter turns by role: User, Assistant, or Tool. Filtering to Tool turns means experts go directly to the calls that need judgment, skipping the full conversation history.

For each tool-call turn, reviewers see the full reasoning chain before and after the call. The reasoning chain context makes expert evaluation possible. Reviewers judge whether the call made sense given what the user asked. JSON validity alone doesn't answer that.

The trace review interface provides pass/fail verdict labels, issue tags, and severity ratings for each turn. Reviewers mark each call against your ground truth labels: selection error, schema error, execution error, or parsing error. The labeling process builds a structured record of where the agent fails and under what conditions.

VentureBeat describes this shift as agent evaluation replacing traditional data labeling as the critical path to production. Success depends on validated reasoning and tool usage, not classified outputs.

How to close the feedback loop and reduce tool-use errors

Labeled traces are only useful if they change something. Here's how to act on them.

Refine tool descriptions where selection errors cluster

Selection errors concentrate around tools with overlapping descriptions. When your labeled dataset shows a tool being chosen incorrectly in a recurring input pattern, rewrite the description to make the boundary explicit. Add examples of inputs that should route to this tool and inputs that should not. This is the fastest fix for selection errors and the one that requires no model changes.

Pre-screen at scale with automated labeling

Once you have a labeled ground truth set, you can train automated prompts to flag high-risk turns before they reach expert reviewers. Label Studio's Prompts feature uses labels generated by AI to flag which turns are most likely to be problematic. Expert time then concentrates on the calls that need judgment, not every turn in a large trace set.

Geberit ran automated pre-labeling before human review and hit 5x labeling throughput and 95 percent annotation accuracy against ground truth, with 4-5x cost savings compared to manual review. Pre-screening didn't replace expert judgment. It reduced the volume that required it, so expert attention went to the calls that mattered.

Build fine-tuning datasets from failure patterns

The labeled traces you've collected are training data. Selection errors in high-ambiguity zones and correctly handled edge cases provide the data a fine-tuning run needs. Examples of the right schema constructed under tricky input conditions also help. Route your labeled traces to your model improvement pipeline after each evaluation cycle.

Evaluation is not a one-time audit. Each cycle of trace review produces labeled data that tightens tool descriptions, refines pre-screening prompts, and feeds the next fine-tuning run. The error rate on the failure modes you've identified goes down as a direct result.

When infrastructure health and tool-use correctness align

Once the evaluation layer is in place, HTTP 200 with sub-150ms latency tells you something it couldn't before. Selection accuracy is backed by ground truth. Schemas have been validated against your rubric. You have a labeled history of where failures concentrate and why. If your agent calls more than 15 tools or touches any system where a wrong write has real consequences, add this layer before you scale. Without it, you're confirming the pipes are running while the water flows to the wrong place.

How does agent tool use evaluation differ from standard unit testing?

Unit testing verifies that a function returns the correct output for a specific input. Agent tool use evaluation measures whether the model selects the right function for a given intent and constructs the call correctly. While unit tests confirm the tool works, evaluation confirms the agent knows how and when to use it.

What is the ideal tool count to maintain high selection accuracy?

Selection accuracy degrades significantly once an agent has more than 20 to 30 tools. At 50 tools, agents frequently fail due to ambiguous overlaps in tool descriptions. If your system requires more than 30 tools, consider a multi-agent architecture or a retrieval-augmented generation (RAG) approach to narrow the tool set provided to the model.

How do I evaluate tool calls for external APIs I do not control?

When you cannot instrument the external system, evaluate the agent based on the tool response it receives. Use a rubric to judge if the agent correctly interpreted the external data or handled errors like rate limits. In HumanSignal, reviewers can tag these as execution or parsing errors based on the trace history.

When should I escalate a tool call to a human at runtime?

Escalate calls that touch high-stakes systems, such as financial mutations or external data writes, especially when the model's confidence score is low. Runtime escalation prevents errors before they happen, while post-hoc evaluation uses those same traces to improve the model's future performance through fine-tuning and description refinement.

Is LLM-as-judge sufficient for verifying tool-use correctness?

Automated judges can catch schema errors but often miss selection and parsing errors that require domain context. Research shows that structural similarity metrics do not always correlate with real-world execution success. Use automated labeling to pre-screen large trace sets, but rely on domain experts for the final verdict on high-stakes reasoning chains.

Related Content