Step-level vs. outcome-level evaluation: What's the difference?
A model returns the right answer. Every outcome metric passes. But research published in 2026 tested Claude Opus 4.6-R across 486 medical questions. Any individual reasoning step could be removed and the final answer changed less than 2 percent of the time. What did you actually evaluate?
TL;DR
Outcome-level evaluation checks the final answer, not the path to it.
Step-level evaluation scores each intermediate step for factuality, validity, coherence, and utility.
Without process-specific training, a frontier model's reasoning steps can be purely decorative.
Training objective, not model size, determines whether a model's reasoning is genuine or ornamental.
For tool-using agents, step-level evaluation is necessary because tool-use errors are irreversible.
What outcome-level evaluation measures
Outcome reward models (ORMs) judge a single thing: the final answer. The model either produced the right output or it did not. For decades, this was sufficient. Single-turn tasks with clear right-or-wrong answers, recoverable errors, and short reasoning chains did not require anything more.
The problem is that ORMs cannot see inside the reasoning chain. A correct final answer might have arrived through sound logic, a lucky cancellation of errors, or steps never causally connected to the output. The signal is too coarse to distinguish between those three situations.
A 2025 PRM survey notes that early pipelines relied on outcome reward models that judged only final answers. As tasks grew longer and more complex, that outcome-centric view could no longer capture stepwise progress, diagnose intermediate errors, or allocate computation adaptively.
The practical consequence is that outcome scores pass models they should not. A model can hallucinate every intermediate step and still land on a correct answer if the final mapping from question to answer is simple enough. Outcome evaluation misses the entire failure.
A 2025 reasoning taxonomy makes the problem precise: answer accuracy is insufficient for measuring reasoning ability. A correct answer does not confirm the preceding reasoning trace was correct. The trace is what teams actually deploy when they ship an agent, so evaluating only the answer means the trace has never been tested.
What step-level evaluation adds
Process Reward Models (PRMs) score each intermediate step, not just the final output. A model earns credit for each correct step. It loses credit for steps that are factually wrong, logically invalid, incoherent, or irrelevant to the task.
Step-level granularity matters most when errors cannot be undone. 2026 tool-use research draws a hard line between mathematical reasoning and agentic tool use. Math errors are often fixable by backtracking. Tool-use failures cause irreversible side effects, making step-level verification critical.
A 2025 reasoning taxonomy identifies four dimensions that step-level evaluation can score where outcome metrics cannot:
Factuality: does the step assert something true? A hallucinated intermediate claim fails here even if the final answer is correct.
Validity: does the step follow logically from the steps before it? Correct facts combined through invalid inference still produce an invalid step.
Coherence: does the step fit semantically with the surrounding reasoning? Steps that are internally correct but disconnected from the task fail here.
Utility: does the step contribute to reaching the answer? Decorative steps that add no information fail this test even when factually correct.
Outcome scores are silent on all four dimensions. A model that scores perfectly on outcomes might fail every dimension at the step level.
The three reasoning modes outcome scores cannot distinguish
Here is what outcome evaluation cannot see: frontier models do not all reason the same way, and the differences are not visible at the output level.
Research from 2026 identifies three distinct modes frontier models use when producing step-by-step reasoning.
Genuine reasoning
Steps are causally necessary. Removing any one of them changes the final answer. The model has not just written down a chain of thought; it has actually followed one. The reasoning trace is the process by which the answer was reached.
Scaffolding
The structure of step-by-step thinking improves accuracy, but individual step content is interchangeable. The model benefits from the extra computation that writing steps forces. Swap one valid-sounding step for another and the answer doesn't change.
Decoration
Steps are ornamental. They do not improve accuracy and the model has already determined its answer before it writes them down. The chain of thought is post-hoc rationalization, written after the answer is already set. Steps can be removed, shuffled, or replaced without affecting the output.
Why training objective matters
The critical finding is that training objective determines which mode a model uses, not organization, model family, or capability tier. DeepSeek R1, trained with a process-focused objective, shows 91 to 93 percent step necessity on mathematics tasks. DeepSeek V3.2, built by the same organization on the same infrastructure but trained for outcomes, shows 4 percent step necessity on the same tasks. Same company. Same architecture. Training objective is the variable.
The mechanistic evidence confirms this is not cosmetic. Reasoning-trained models semantically process their steps, showing a 7 to 19 percentage point attention gap over shuffled text. Models without process training attend purely positionally. They are reading the structure, not the meaning.
The 2 percent figure from medical reasoning matters because it shows what decoration mode looks like at scale. Claude Opus 4.6-R is a frontier model by any standard. Without process-specific training, it operates in decoration mode on medical questions. Its reasoning traces look identical to a model doing genuine reasoning. Its outcome scores are indistinguishable. Only step-level evaluation reveals the difference.
Choosing the right evaluation level for your workflow
The decision comes down to two questions. Are errors recoverable? Does the reasoning chain need to be auditable or feed a training loop?
When outcome-level is sufficient
For single-turn tasks where errors are caught downstream and the path to the answer is irrelevant, outcome evaluation is defensible and cheaper. Retrieval tasks, classification tasks with human review, and low-stakes summarization all sit comfortably at the outcome level. The overhead of step-level annotation doesn't pay off when no step in the chain can cause irreversible harm.
When step-level is necessary
If your model calls external tools, writes to databases, executes code, or triggers real-world actions, outcome evaluation isn't enough. A wrong API call isn't recoverable by checking the final answer. A wrong write isn't undone by noticing the output looked strange. Step-level evaluation is the only mechanism that catches errors before they propagate through a multi-stage pipeline.
Step-level evaluation is also necessary when you're building a training loop. Outcome scores provide too little signal to identify which steps to correct and which to reinforce. PRMs operate in a closed loop: generate process data, train models to score individual steps, then use those scores for test-time scaling or reinforcement learning. Better training data follows. None of it works without step-level labels to start from.
The annotation cost objection
Step-level annotation costs more per example than outcome labeling. For simple, high-volume tasks, the overhead outweighs the diagnostic value.
The cost gap narrows considerably with automated step-level judges. The LLM-w-Rationale framework achieves a Pearson correlation of 0.87 with human expert evaluation. It requires only 1.4 percent of the time of full manual review. That 70x reduction makes step-level evaluation tractable for high-stakes workflows.
In practice, triage: use automated step-level judges to flag turns that look risky, then route them to human reviewers. HumanSignal's agentic trace interface is built for this workflow. Teams can filter individual turns by type (User, Assistant, or Tool call) and assign pass/fail verdicts, severity scores, and issue tags to specific steps. Human reviewers see only the steps that need attention, not every turn in every trajectory.
Teams with limited annotation capacity should reserve step-level review for workflows where an error in one step propagates irreversibly through subsequent steps. For everything else, outcome evaluation is adequate.
The evaluation choice is a reasoning claim
When you run outcome-level evaluation, you're confirming the answer arrived. When you run step-level evaluation, you're confirming the route was sound. The opening question has a direct answer: a model with right answers but decorative reasoning has passed only a destination check.
The decision rule is direct. If your agent calls external tools, writes to external systems, or feeds outputs into a training loop, you can't defer step-level evaluation. Those errors don't reverse. A correct outcome from decorative reasoning looks identical to one from genuine reasoning. The difference surfaces only when the reasoning fails somewhere that matters. Step-level evaluation is how you find out which kind you have before that happens. For teams ready to scale that process across agent workflows, the tooling already exists.
What is the difference between step-level and trajectory evaluation?
Step-level evaluation scores individual intermediate actions, such as a single tool call or reasoning thought, while trajectory evaluation judges the entire sequence of steps as a whole. Trajectory metrics capture the overall efficiency and flow of the path, but only step-level evaluation provides the granular credit assignment needed to identify the exact "first-error" index where an agentic workflow began to fail.
How much more does step-level annotation cost than outcome labeling?
Step-level annotation is significantly more expensive because it requires reviewers to verify multiple intermediate points rather than a single final answer. However, the LLM-w-Rationale framework can reduce this overhead by 98.6% compared to manual review, achieving a 0.87 correlation with human experts by using automated judges to flag specific risky steps for human oversight.
Can an LLM judge reliably perform step-level evaluation?
Yes, but reliability depends on the judge's training and the specific criteria used, such as factuality, validity, coherence, and utility. While outcome-level judges often miss "lucky successes," specialized Process Reward Models (PRMs) can effectively score reasoning steps, allowing smaller models to outperform much larger ones by searching for the highest-scoring reasoning path at inference time.
Which tasks require step-level review instead of outcome scoring?
Step-level review is necessary for high-stakes, multi-step tasks where errors are irreversible, such as agents calling external APIs, writing to databases, or performing medical reasoning. Outcome scoring is only defensible for low-stakes, single-turn tasks like summarization or classification where the final output is easily verifiable and the intermediate path carries no operational risk.
How do I use step-level evaluation data to build a training loop?
You can use step-level scores to build a Process Reward Model (PRM) loop by generating process data, training a model to score individual steps, and using those scores for reinforcement learning or test-time scaling. This fine-grained feedback allows you to reinforce specific correct reasoning behaviors rather than just rewarding the final answer, which is critical for steering models away from decorative reasoning.