How do you define success for an AI agent?
89 percent of teams building AI agents have observability in place: logs, dashboards, and completion rates. Only 52 percent have adopted formal evaluation frameworks, according to LangChain's State of Agent Engineering report. That is a definitional gap. Most teams watch their agents run. Few have a standard for deciding whether those agents run well. Monitoring confirms that a process ran. Whether the logic behind the result was sound is a different question.
TL;DR
Monitoring confirms an agent ran; evaluation determines whether it ran well.
1 in 7 log-confirmed completions fail the user (Salesforce, 2026).
Output accuracy doesn't detect failures in multi-step agent paths.
Trajectory evaluation scores the steps an agent took, not just its final answer.
Single-step agents don't need trajectory review; multi-step ones do.
Why most agent success metrics measure the wrong thing
Task completion rate and output accuracy become the default success metrics because observability tools surface them automatically. They are easy to collect, easy to report, and easy to green-light in a review meeting.
What they confirm is that the agent produced an output. They say nothing about whether the output was reached soundly.
The gap between monitoring and evaluation matters at scale. LangChain's 2026 State of Agent Engineering report found 57 percent of organizations have agents in production. Large enterprises lead at 67 percent. Gartner projects 40 percent of those projects will fail or be canceled by 2027, citing agent-washing and skipped process mapping as the primary causes (Gartner, AI Agent Layer, 2026).
Standard metrics can't close the gap between "deployed" and "succeeding" because they were never built to detect it.
How output metrics produce phantom successes
The terminal answer problem
An agent can reach a correct final answer through a broken path. It might call the wrong tool, retrieve the wrong document, or make an invalid inference that cancels out by the last step. The output looks right. The reasoning that produced it will not hold for the next query.
Salesforce documented this in their 2026 quality analysis: 1 in 7 interactions that appear successful in system logs fail the user. They call these "phantom successes." A system reporting 92 percent accuracy and 89 percent completion masks a decline in usage when the metric tracks output appearance instead of user outcome.
The task-length problem
For single-step agents, output accuracy often suffices. A classifier that routes a support ticket to the correct queue either routes it correctly or it doesn't. A tool call that fetches a static record either returns the right record or it doesn't. When the agent's entire operation is one step, evaluating the output evaluates the path.
Evaluation requirements change the moment agents run multi-step workflows. Toby Ord's research (arXiv:2505.05115) describes a "half-life" for AI agent success. As tasks grow longer and involve more subtasks, each additional step compounds the probability that a single failure ruins the whole task. A six-step agent with 90 percent per-step accuracy has roughly a 53 percent chance of completing cleanly. A twelve-step agent at the same accuracy succeeds less than 28 percent of the time.
Output accuracy at the terminal step never detects this. It records whether the last step looked right, blind to the five or eleven steps before it.
Step-by-step evaluation earns its cost in exactly this scenario. Teams running simple deterministic agents can skip it. Teams running agents with branching decision paths, chained tool calls, or conditional logic cannot.
The monitoring trap
89 percent of teams have observability. 52 percent have formal evaluation. The remaining 37 percent may be seeing confirmation that agents ran without visibility into the reasoning steps. In multi-step deployments, this lack of visibility can make it difficult to identify where a process breaks down. An agent that takes broken paths to correct answers will degrade unpredictably as the prompt distribution shifts, and no dashboard will surface the pattern before it does.
What a trajectory-based definition of success looks like
Trajectory evaluation means reviewing the sequence of steps an agent took, not just the final answer it returned. That review covers tool calls, intermediate outputs, decision branches, and handoff points between agent and system.
The definition itself is: the agent succeeded if its trajectory was sound enough to run again under similar conditions.
Rubrics
Teams use rubrics to define what a sound step looks like for a specific agent and context. For a research agent, a sound retrieval step involves querying the most relevant source before summarizing. For a customer service agent, a sound escalation step involves confirming the issue category before routing. Rubrics are not universal. They are written for the agent and the task, and they give reviewers a consistent standard to apply.
Ground truth sets
A ground truth set anchors the rubric. It is a collection of traced trajectories that have already been reviewed and labeled as sound or unsound. New traces get scored against this baseline. Without an anchor, rubric scores drift over time as reviewers calibrate independently.
Calibrated automated scoring
Human review at volume doesn't scale for most teams. An LLM-as-judge handles the throughput, but only after you've validated its scores against human-reviewed ground truth. Without that validation step, you can't tell whether the automated scores reflect your rubric or just pattern-match the surface of a correct-looking step. When the gap between automated and human scores widens, you adjust the model.
Disagreement review
When automated scores and human raters diverge on the same step type, that divergence is signal. Disagreement review covers examining those cases, identifying which judgment was correct, and updating the rubric or calibration accordingly. The mechanics of scaling evaluation for agent workflows describe this loop in detail.
Together, these four components produce a repeatable answer to the question "was this execution path sound?" Without them, success is whatever the logs say it was.
Calibrating the standard to the use case
A trajectory definition still requires a threshold. "Sound enough to run again" means different things depending on where the agent is deployed and what it controls. LangChain's 2026 data on production blockers shows how that threshold shifts by context:
Quality and accuracy are the top barrier at 32 percent of cases, cutting across all deployment types. For an internal research agent summarizing documents, a rubric that flags hallucinated citations is the primary quality gate.
Security is the second barrier for large enterprises at 25 percent. An agent with write access to production systems requires trajectory review that includes permission-check steps and boundary adherence, not just output quality.
Latency is the primary barrier for customer-facing agents at 20 percent. For these agents, a sound trajectory is also a fast one. A technically correct path that takes 45 seconds is not a viable customer experience.
Large organizations account for this by sequencing deployment. LangChain found that 26.8 percent of large enterprises prioritize internal productivity use cases before customer-facing roles. The logic is reliability testing. Internal agents operate in lower-stakes environments where rubrics can be refined and calibration can be tuned before the agent faces external users.
The evaluation framework is the same whether the agent is internal or external. The pass/fail threshold is not.
By 2028, 45 percent of CIOs are expected to lead AI agent systems outside of IT, according to Gartner's AI Agent Layer report (2026). Gartner recommends a cross-functional council spanning the CIO, CFO, COO, CHRO, and General Counsel to govern those systems. The standard for what counts as a sound step is a business decision, not just a technical one.
From definition to repeatable process
A definition is only useful if it generates a feedback loop. The operational form: review traces where automated scores and human raters disagree. Use those disagreements to update rubrics, and track whether the changes reduce disagreement over time. Stable disagreement rates signal that the rubric has stopped detecting real variance. Declining disagreement rates signal that the rubric is improving the agent.
That 37-point gap between monitoring and formal evaluation isn't a tooling problem. It's a loop problem.
Scoutbee, a supply chain intelligence company, built evaluation cycles into their model development process using Label Studio. The results: a 20x reduction in the time to label, train, and maintain models. Model accuracy held above 90 percent across millions of documents. Revenue from ML-based products increased 2 to 3 times. Scoutbee's gains did not come from switching to better models. They came from closing the loop between evaluation signal and model improvement, consistently, at scale.
If the trajectory isn't defined, neither is success
A dashboard full of green checks is a record that the agent ran. Success is the standard applied to how it ran. Teams that write rubrics, anchor them to reviewed ground truth, and close the disagreement loop build agents worth trusting. Teams that don't are watching dashboards.
If you cannot describe what a sound step looks like for your agent, you do not yet have a definition of success. You have a hope that the output will look right.
How does trajectory evaluation compare to output accuracy scoring?
Output accuracy only confirms if the final answer looks correct, which can mask broken reasoning or "phantom successes." Trajectory evaluation reviews the specific sequence of tool calls and intermediate steps an agent took. This approach is necessary for multi-step workflows where a correct answer might result from a flawed process that will fail under different conditions.
What is a reasonable success rate for production AI agents?
Success rates vary by task complexity, but research shows a half-life effect where reliability drops as steps increase. A twelve-step agent with 90 percent per-step accuracy succeeds less than 28 percent of the time. For high-stakes enterprise roles, Gartner predicts 40 percent of projects will fail by 2027 if they skip formal process mapping and evaluation.
How do I detect if an agent is reporting a "helpful lie" in logs?
A helpful lie occurs when an agent claims success in system logs despite a backend failure, often to resolve the narrative of the task. Salesforce found that 1 in 7 interactions that appear successful in logs actually fail the user. Detecting these requires comparing log-level completions against human-reviewed ground truth sets to identify discrepancies.
Can a small team run formal evaluation without a dedicated function?
Yes, small teams can implement formal evaluation by using calibrated automated scoring. By validating an LLM-as-judge against a small set of human-reviewed trajectories, a team can scale review throughput without constant manual intervention. According to HumanSignal's research, this loop allows teams to update rubrics only when automated and human scores diverge.
What happens when automated scores and human raters persistently disagree?
Persistent disagreement signals that the rubric is either too vague or the model lacks the context to judge a specific step type. This divergence is a primary signal for disagreement review, where teams examine the conflicting cases to refine the standard. Stable disagreement rates often indicate the evaluation framework has stopped detecting meaningful variance in agent performance.