How to evaluate agent planning
An agent returns the correct final answer. The observability dashboard shows no errors. But the agent took 11 steps to do a 3-step job. It called the same tool twice with conflicting inputs, and recovered only because a retry happened to work. The output was right. The plan was broken. Output evaluation caught nothing.
That scenario is more common than it should be. Most teams build evaluation around the agent's final return instead of its decision-making process. When the plan is what fails, output metrics give you no signal at all.
TL;DR
Output evaluation and plan evaluation are separate jobs.
Agent planning fails across four dimensions: decomposition, path selection, tool use, and reflection.
A rubric that scores all four catches failures output metrics hide.
Human trace review surfaces planning failures that automated checks miss.
The evaluation loop closes when findings feed back into prompts or ground truth.
Why output evaluation misses the planning layer
Workflows follow predefined code paths. Agents use an LLM to direct their own processes and tool usage dynamically, choosing their sequence of steps at runtime (Anthropic). That autonomy is the point, but it's also what creates the evaluation problem.
When an agent picks its own path, the route to the answer isn't fixed. Two agents can return identical outputs from completely different plans, one sound and one fragile. Evaluating only the output treats both as equally good. They're not.
Companies are deploying agents faster than they can make them work reliably. A 2025 PwC survey found that 79 percent of companies have adopted AI agents. Yet 68 percent report that half or fewer of their employees interact with agents in everyday work. Agents are in production. Reliable planning is not.
A rubric for the four planning failure modes
Agent planning breaks into four documented strategies: task decomposition, plan selection, external module use, and reflection or memory (arXiv:2402.02716). Each one fails in a different way. Score all four on every trace, not just the ones that produced a wrong answer.
Task decomposition
If an agent routes three subtasks through one tool call when they require separate ones, it's failed at decomposition, even if the final answer looks right. Look for steps that are too coarse to act on, subtasks with shared dependencies sequenced in the wrong order, or steps that are skipped entirely.
When reviewing a trace, check: did the agent identify the right number of steps? Were any steps duplicated? The output doesn't change that verdict.
Plan selection
The scoring question is simple: given what the agent knew at step one, was the chosen path the most direct route to the goal? If it required two recoveries to work, the selection was wrong even if the recoveries succeeded.
Failure here isn't always obvious. The agent may pick a path that worked on a simpler version of the problem, or fall back on a previously successful route when the context has changed. Both produce an answer. Neither reflects sound planning.
External module use
Tool calls and retriever queries are where plans break at the execution layer. An agent can decompose correctly and select the right path, then issue a malformed tool call that forces a retry. Failure here is measurable: duplicate calls to the same tool, calls with conflicting parameters, or retriever queries that return results the agent then ignores.
For each tool call in the trace, check three things: was the tool right for that step, were the inputs well-formed, and did the agent use the returned output.
Reflection and memory
In orchestrator-worker architectures, a central LLM breaks down tasks and delegates them to worker LLMs, then synthesizes results. The orchestrator's reflection step is what prevents errors from compounding. Failure here means the agent continued executing after a step returned an anomalous result.
Score reflection by checking whether the agent updated its plan when a step returned unexpected output. If a tool call returned an error code and the agent proceeded as if it succeeded, reflection failed.
Setting up human review for agent traces
Human review of agent planning works best as a structured workflow: defined inputs, per-step scoring, and verdicts that feed back into the agent. Without that structure, review becomes ad-hoc spot-checking.
Capturing traces
A trace records each step the agent took: the input, the tool or model call, and the output at each transition. Tools like Braintrust, LangSmith, and Langfuse produce this format. Reviewers score that trace artifact.
Not every trace needs review.
Start with traces from agents that produced correct final outputs but consumed more tokens or time than expected. Plans with high token consumption are most likely to contain recoverable failures.
Applying the rubric at the step level
Teams use HumanSignal's human-in-the-loop evaluation to manage this specific trace-review workflow. Reviewers import traces from observability tools and apply per-step rubric scores. They then record an overall verdict alongside failure mode tags.
The per-step scoring is what separates planning evaluation from output evaluation. A reviewer can mark step 3 as a decomposition failure and step 7 as a tool-use failure while still marking the overall output as acceptable. Distinguishing between step and output quality surfaces the broken plan the dashboard missed.
Recording verdicts and failure modes
Programmable interfaces for agent evaluation let reviewers tag failure modes with evidence spans that identify which step, input, and output constitute the failure. "The plan failed" isn't actionable. "Step 4 issued a duplicate tool call with conflicting date parameters" is.
Set a standard verdict structure before review begins. Define four fields per trace: step correctness, failure category, severity, and overall plan quality score.
Balancing evaluation depth against cost and latency
Full trace review isn't always viable in production. An agent processing 10,000 requests per day can't route every trace to a human reviewer. Evaluation depth carries the same cost and latency tension as planning itself: Dynamic Speculative Planning (arXiv:2509.01920) reduced total operational costs by 30 percent and unnecessary costs by up to 60 percent by making trade-offs explicit. The same logic applies here.
For agents with hard latency requirements, synchronous human review on every trace breaks the workflow. If a plan takes 400 milliseconds and a human review gate adds 30 seconds, real-time use cases are incompatible with full review. For these agents, use automated structural checks at runtime. Ask three questions of every trace. Did the plan include required steps? Did tool calls return expected types? Did the agent update its state? These checks catch structural failures without adding review latency.
Asynchronous human sampling covers what automated checks miss. Score a weekly sample of 5 to 10 percent of traces against the rubric. This catches decomposition errors, accidental path selections, and reflection steps that ignored relevant outputs.
Reserve full trace review for agents in high-stakes scenarios. High-stakes agents include those writing to production systems, driving decisions, or operating in regulated domains. In those domains, a recoverable failure often becomes a policy failure.
Turning evaluation findings into planning improvements
Review verdicts have no value unless they change something. The feedback loop connects per-step findings to the inputs that govern future planning: the system prompt, the ground truth examples, and the rubric criteria.
HumanSignal's open-source Adala framework applies an act, observe, and adapt cycle where agents accumulate memory of what worked and what failed. Each round of human review adds to that memory. A decomposition failure tagged in week one becomes a corrective example in the ground truth by week two.
When reviewers tag a failure mode with an evidence span, that span becomes a labeled training signal. A step scored as a tool-use failure with a parameter conflict shows the agent which behavior to avoid, not just that performance was low. Scaling evaluation for agent workflows means applying rubrics to reasoning trajectories so intermediate planning decisions get scored and improved, not just the final answer.
Plan quality improves when evaluation findings are routed back into the agent's inputs on a fixed cadence. Monthly prompt updates target the highest-frequency failure modes. Quarterly ground truth revisions incorporate accumulated verdicts. Together, they give the feedback loop a cadence that prevents findings from stacking up without effect.
The plan is the thing to get right
The opening scenario showed an agent that reached the right answer through a broken plan. Output evaluation returned a clean pass. After applying the rubric and routing traces to reviewers, the same broken plan gets surfaced, labeled, and fixed before the next run. The right answer stops being luck.
Step-level evaluation is what changes the reliability curve. Route every finding back into the agent's ground truth. Pick one agent in production, pull its last 10 traces, and score each one against the four dimensions: decomposition, plan selection, tool use, and reflection. The failures you find in those 10 traces are the ones your output metrics have been missing.
How does agent planning evaluation differ from standard output evaluation?
Output evaluation only checks if the final answer is correct, which can hide inefficient or fragile reasoning. Planning evaluation uses a rubric to score the intermediate steps, such as task decomposition and tool selection, ensuring the agent reached the result through a sound process rather than a lucky recovery.
What observability tools produce traces for step-level review?
Standard observability platforms like Braintrust, LangSmith, and Langfuse capture the full execution trace, including every model call and tool input. These trace artifacts can be imported into HumanSignal to allow reviewers to apply per-step rubric scores and identify specific planning failures.
How often should I run human review sampling in production?
For agents with high request volumes, a weekly asynchronous sample of 5% to 10% of traces is usually sufficient to identify systemic planning drifts. High-stakes agents operating in regulated domains or writing to production systems often require a more rigorous audit of every trace to prevent policy violations.
What should I do when agents produce conflicting plans for the same resource?
In multi-agent systems, conflicting plans often stem from coarse task decomposition or a lack of shared state. You can resolve these by using an orchestrator-worker pattern where a central LLM synthesizes subtasks before delegation or by enforcing deterministic state machines that restrict tool access based on the current system state.
How do I write rubric criteria for plans with no single correct path?
Instead of scoring against a fixed sequence, write criteria based on functional constraints like "did the agent identify all required dependencies" or "was the tool call well-formed." This approach, supported by Adala's act and observe cycle, evaluates the logic of the trajectory rather than its exact match to a template.