How to handle non-determinism in agent evaluation
Setting temperature to zero is the most common answer teams give when asked how they handle non-determinism in agent evaluation. It's understandable: the name implies a kind of stillness, a stopping of the randomness. Relying on that setting alone creates a misleading sense of stability. Hardware-level variance persists even when the temperature is set to zero.
TL;DR
Setting temperature to zero does not guarantee identical results; accuracy can still vary by 15 percent across runs.
Multi-run testing often reveals lower success rates than single-run benchmarks suggest.
Use optimistic and pessimistic bound metrics to find your agent's true reliability range.
Human review should focus on traces where automated scores are ambiguous.
Agents with large gaps between best-run and worst-run scores may need more tuning before deployment.
Why temperature=0 doesn't eliminate non-determinism
The temperature setting controls token sampling, not the underlying compute. An LLM non-determinism study of five models across eight tasks found accuracy variations of up to 15 percent across runs configured at temperature 0. The gap between best and worst performance reached up to 70 percent on some tasks. None of the tested models delivered identical accuracy across tasks, much less identical output strings.
Non-determinism is likely essential to efficient compute use, not a software configuration error. Co-mingled data in input buffers makes parallel GPU operations order-dependent. LLMs are non-deterministic by design, and a temperature parameter can't override that.
Practitioners have arrived at the same conclusion from a different angle. Floating-point operations on GPUs are massively parallel, and the order of those operations can vary between runs even with identical seeds. You get different outputs not because of randomness in the probabilistic sense, but because of hardware-level variance in how arithmetic resolves. Temperature=0 is not a fix. It's a different setting on a system that was never fully controllable.
The reliability gap your single-run metrics are hiding
What the benchmark numbers actually measure
Single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run you select, per agentic benchmark research. Standard deviations exceed 1.5 percentage points even at temperature 0. That range has a consequence: reported improvements of 2 to 3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. A model that seems to have improved between eval cycles may have just had a better run.
The reliability gap is a practical problem. Agents that achieve 60 percent success on a single run can measure closer to 25 percent when consistency is tracked across multiple runs. A single run provides a valid data point. It doesn't capture the full distribution of outcomes that multi-run evaluation would reveal.
Why trajectories compound the problem
Agents build their responses step by step, and small differences early on can branch into a different path entirely. Agent trajectories diverge early, often within the first few percent of tokens, because each step conditions on the previous output. Small initial differences cascade into different solution strategies by step three or four. A minor variance in how the agent phrases its first tool call can produce a different execution path three steps later. That path may or may not reach the right outcome.
Scaling evaluation for agent workflows requires a different approach than evaluating a single model response, because of these cascading differences. You're not measuring a point. You're sampling from a distribution of possible execution paths, and a single run tells you almost nothing about the shape of that distribution.
Building a multi-run evaluation baseline
Three practices that produce defensible numbers
Researchers studying non-determinism in agentic benchmarks recommend three practices for generating reliable measurements that address the variance pattern described above.
Estimate pass@1 from multiple independent runs. Run each evaluation task independently multiple times and average the results. Averaging across runs absorbs natural variance; a single selected run hides it.
Use statistical power analysis before treating a number as real. The required number of runs depends on the effect size you're trying to detect. If you want to distinguish a 2-point improvement from noise, the math tells you how many runs that requires. Skipping this step means your evaluation budget is guesswork.
Track pass@k and pass^k with k greater than 1. Pass@k gives you the optimistic bound: what the agent can do when it gets multiple attempts. Pass^k gives you the pessimistic bound: what it delivers consistently. The distance between them is your reliability envelope. A wide envelope signals a capable but unstable agent. A narrow one signals something you can actually deploy.
Measuring output stability directly
Two additional metrics help quantify stability across runs. TARr@N measures total agreement rate over raw output across N runs; TARa@N measures total agreement rate of parsed answers across N runs. The LLM non-determinism study introduced both metrics. These metrics show how consistent an agent's output is across identical inputs and identify whether a single component is the source of variance.
The cost of statistical rigor
Running an agent 10 to 20 times per task for proper statistical coverage is expensive. For teams using frontier models at current API pricing, a 15-task eval suite run 10 times costs more than a single-run eval. Below a certain deployment scale, the cost of full statistical coverage can exceed the cost of occasional production failures.
The expense is a genuine constraint. Defensible compromises exist. Use a smaller model for grading, focus on tasks with the highest risk, or set k=3 with a pessimistic bound as the floor. Pass^3 won't give you the full distribution, but it will surface agents whose pessimistic performance is dramatically below their single-run score. Teams often discover this specific failure mode only after deployment.
Scaling human review without creating a bottleneck
What you're actually reviewing
The output of an agent evaluation isn't a response string. It's an agentic trace: the step-by-step log of reasoning, tool calls, memory retrievals, and environmental interactions that produced a final result.
Non-determinism causes problems across four layers of agent behavior, and reviewing only the final output misses most of them:
LLM behavior: instruction following, safety, and whether the model interpreted the task correctly
Memory: storage correctness and retrieval precision. Did the agent recall the right context?
Tool use: selection accuracy, parameter mapping, and sequencing. Did it call the right tool with the right inputs in the right order?
Environment: authorization checks and resource limits. Did it operate within the boundaries it was given?
Research on end-to-end agent assessment found that evaluating across these four pillars captures behavioral deviations that pass/fail metrics overlook. A tool call that uses deactivate_user() instead of delete_user() passes a success check while creating a data integrity issue. The four-pillar framework flags it; the final output metric doesn't.
Routing review efficiently
Human review becomes your bottleneck if you treat every trace the same. Use a score-band threshold: let automated scoring handle high-confidence results in both directions, and route only the ambiguous band to reviewers.
LLM-based judges introduce their own non-determinism into evaluation results. Human-in-the-loop evaluation for agentic AI is the layer that makes automated scoring trustworthy. Traces that score clearly pass or clearly fail don't need a human reviewer. Traces that land in the ambiguous band do.
Teams use HumanSignal's Evaluations feature to manage this triage. The platform offers three workflow modes: Fully Automated, Hybrid, and Fully Manual. Each mode lets teams set score thresholds so traces route to human reviewers only when automated scoring falls in uncertain territory. Domain expert time stays focused on the traces where it changes the result.
A tiered evaluation workflow for production decisions
With the statistical baseline and human review layer in place, the workflow breaks into three tiers you can apply consistently.
Tier 1 (Automated multi-run scoring): Run every candidate through pass@3 at minimum, tracking both the optimistic and pessimistic bounds. Identify any agent where pass^3 falls more than 10 points below the single-run pass@1 estimate. That difference indicates a reliability issue.
Tier 2 (LLM-as-judge triage): Apply automated scoring using a smaller model to every trace in the ambiguous band. This narrows the queue for human review without discarding the traces that are most significant.
Tier 3 (Structured human expert review): Route flagged traces to domain expert review using annotation interfaces with structured rubrics across the four evaluation pillars. Reviewers assess traces, not just outputs. Human-reviewed traces feed back into the evaluation rubric, which makes the scoring criteria more stable across future runs.
Structured annotation makes review faster at scale. Scoutbee moved to this workflow and cut labeling and model maintenance time by 20x without dropping below their quality SLA. Their ML-based products scaled faster as a result, contributing to a 2 to 3x revenue increase. The reason is routing: expert time goes to decisions that require expert judgment, not to traces automated scoring can handle.
The QA workflow for annotators applies here as well. Human reviewers evaluating non-deterministic agent traces need calibration and inter-annotator agreement checks, or the human layer introduces its own instability.
Evaluation is how you make non-determinism manageable
Temperature was never the control point because the hardware determines output. You can control how thoroughly you sample the distribution before shipping. Human reviewers should see the traces where automated scoring lacks confidence.
The decision rule is concrete: if the pessimistic bound falls more than 10 points below the single-run estimate, the agent is not production-ready. A high optimistic score doesn't change that. The gap isn't measurement noise. It's the agent showing you what it will do on a bad run, and bad runs happen in production.
How many runs are required for statistically significant agent evaluation?
Reliable pass@1 estimates require at least 10 to 20 independent runs per task to account for hardware-level variance. Research on agentic benchmarks shows that standard deviations exceed 1.5 percentage points even at temperature 0. Without this volume, reported improvements of 2 to 3 points often reflect evaluation noise rather than genuine progress.
What is the most cost-effective way to run multi-run evaluations?
To manage API costs, use a tiered sampling strategy where you run a smaller, faster model for initial pass@k checks and reserve frontier models for the final pessimistic bound (pass^k) calculation. Focus your budget on tasks with the highest variance, as agents with a 60 percent success rate on single runs can drop significantly when measured for consistency.
When should I escalate an agent trace from automated to human review?
Escalate traces to human review when automated scores land in an ambiguous middle band, typically between 0.4 and 0.7 on a 0-1 scale. In HumanSignal, you can configure these score-band thresholds to route uncertain results to experts. This ensures human time is spent only on traces where automated judgment is unreliable.
How do I build a golden dataset when agent tasks have no single correct answer?
Focus on bounding the acceptable reasoning path rather than just the final string. A golden dataset for agents should include the four pillars of assessment: correct tool selection, precise memory retrieval, environment authorization, and instruction following. Use these structured rubrics to define "correctness" across the entire agentic trace.
How does non-determinism compound in multi-agent systems?
Non-determinism creates a butterfly effect where a minor variance in the first agent’s output cascades into different execution paths for downstream agents. Because trajectories diverge within the first few tokens, small initial differences in phrasing or tool parameters can lead to entirely different solution strategies by the time the final agent receives the handoff.