What is the difference between task completion and task correctness?
The agent's dashboard showed green. "Task complete." The workflow log showed every step executed in sequence. Three days later, a human checked the database. The row the agent was supposed to update was still blank. The downstream financial report had been generated from stale data. Nobody had flagged it because the system reported success.
Research from Carnegie Mellon University and Salesforce found that multi-step AI agents produce incorrect results roughly 70 percent of the time, even while returning signals that look like success. Completion and correctness are not the same measurement. Treating them as equivalent is how production systems produce wrong outputs without anyone noticing.
TL;DR
Task completion is a binary finish signal; it doesn't measure outcome quality.
Task correctness asks whether the underlying system state is right.
Self-reported completion explains only 6% of verified outcomes.
Correctness governs evaluation when errors propagate downstream or affect regulated outputs.
Structured review workflows (gated onboarding, disagreement review, calibration) are how correctness gets operationalized at scale.
Task completion: what the signal actually measures
Task completion has roots in usability research, not AI quality measurement. It's often called "the UX bottom line": a binary measure of whether a user reached the end of a workflow, not whether what they produced along the way was good. The agent either terminated the task or it didn't.
The usability-focused framing made sense when researchers needed a simple proxy for whether a product was usable. It becomes misleading when applied to AI agents operating on live system states. Reaching the end of a workflow and producing a correct result are two different things.
Current models complete tasks that take a human under four minutes with close to 100 percent success. On tasks that take a human more than 4 hours, that drops below 10 percent, according to METR's 2025 research. As tasks grow longer, the finish signal gets less reliable as a proxy for result quality. An agent can traverse 40 sequential steps and still return a wrong output.
Completion measures whether the agent stopped at the right place. It says nothing about what it left behind.
Task correctness: measuring the state, not the step
Functional correctness asks a different question: does the underlying system state match what the task was supposed to produce? An agent that updates a database field, drafts a contract clause, or routes a support ticket has changed something in the world. Correctness means verifying that the change matches the intended outcome.
The standard for evaluation is how well an agent accomplishes its intended tasks: whether the response is grounded in the right information and achieves the right effect (ACM Queue, 2025). For high-stakes tasks, that requires a verification strategy, especially when the AI's output arrives faster than a human reviewer can check it.
Correctness is not a single number. It breaks into three dimensions: task completion, data retrieval accuracy, and result completeness, according to AssetOpsBench (arXiv, 2026). An agent can score well on the first dimension while failing on the second or third. A single completion flag hides all of that.
Correctness is also stateful. It requires inspecting what changed in the system, not just what the agent reported.
Where completion and correctness diverge: the hallucination of success
Practitioners often call this gap the "hallucination of success." An agent returns a plausible completion log, the system records a green status, and the actual system state doesn't match the intended outcome.
Self-reported completion is not a proxy for verified success
Self-reported task completion rates explain only around 6 percent of verified completion rates, according to MeasuringU's analysis. That breaks a common assumption: that a finished task is probably a correct one. Humans who said they completed a task failed it at rates that bore almost no relationship to their self-assessment. AI agents produce the same pattern at scale. The completion signal isn't reliable evidence of what happened in the system.
Reasoning models fail at correctness while appearing to complete tasks
Apple Machine Learning Research found that even models designed for multi-step reasoning fail at exact computation and reason inconsistently across similar problems (Illusion of Thinking, 2025). The model produces a chain of plausible-looking steps, reaches a conclusion, and returns a result. The result is wrong even when the reasoning chain looks sound. Thinking through a problem doesn't guarantee getting it right.
Multi-step agents produce incorrect results approximately 70 percent of the time, according to research from Carnegie Mellon University and Salesforce. The completion signal and the correct-result rate move independently.
The gap widens sharply in enterprise workflows
Simple UI tasks achieve 67 to 85 percent success rates in production. Enterprise workflows in systems like SAP and Workday drop to 9 to 19 percent, according to UI-CUBE benchmarking research (arXiv, 2025). Researchers attribute this drop to how models struggle to manage memory, plan hierarchically, and coordinate state across steps. In interconnected enterprise workflows, an agent can execute each step and still return an inaccurate final result. The completion rate stays high while correctness falls. Both metrics are visible only if you measure both.
When to prioritize each metric
The choice depends on the cost of a mistake.
Completion rate governs when:
Failure is recoverable with low effort (a draft a human will review before it affects anything, a suggestion the user can accept or discard)
Speed and iteration volume matter more than guaranteed accuracy in any single run
The task operates in a sandboxed or reversible environment where wrong outputs don't propagate
Correctness governs when:
The agent's output changes a system state that downstream processes read and act on (database records, financial ledgers, compliance documents)
Errors are not immediately visible and accumulate before anyone checks
The domain carries regulatory or legal consequences for incorrect outputs
A wrong result costs more to fix than the task itself was worth to automate
There's a project-viability dimension here too. More than 40 percent of agentic AI projects will be cancelled by end of 2027, according to Gartner, citing rising costs, unclear business value, and insufficient risk controls. Projects that track completion volume without verifying correctness often look healthy in staging and surface problems later in production.
The alpha-value metric (GitTaskBench, AAAI 2026) puts a number on it: integrating task success rates with token cost and developer salary to measure what agent performance delivers in practice. A completed task with a wrong output can have a negative alpha-value. It consumed compute and time while producing a result that requires human remediation.
How correctness gets built into evaluation workflows
To move correctness from a manual spot-check to a scalable workflow, you need three intervention points in the evaluation loop.
Gated onboarding
Before any annotator or automated reviewer contributes to production data, a correctness gate sets the quality bar. Gated onboarding involves testing against a set of ground-truth tasks with a defined minimum score. Annotators who don't meet the threshold don't advance to production volume. The gate catches the gap between "followed the instructions" and "produced correct outputs" before it propagates into training data.
HumanSignal's quality review documentation makes the consequence clear: weak annotations used in training degrade model performance. Completion without a quality gate means volume without reliability.
Disagreement review
When two reviewers assess the same task and disagree, the standard response is to treat one of them as wrong. A more useful interpretation is that the rubric is ambiguous. Disagreement signals exactly where the evaluation criteria need tightening, and resolving those cases through expert adjudication produces reusable improvements to the workflow.
HumanSignal's evaluation framework treats disagreements as evidence about the rubric, not just about the annotators. Scaling agent evaluation requires a disagreement process that turns ambiguous cases into rubric updates rather than discarding them.
Calibration
Automated judges (LLMs scoring outputs at scale) drift from the intended correctness standard as volume increases. Calibration means running the automated judge against a ground-truth set, comparing its scores to expert human scores, and identifying where it mis-scores. Those divergence points become the focus of rubric refinement and instruction updates.
Automated scores stay anchored to what human experts consider correct, so volume doesn't erode the standard.
Sense Street, a capital markets fintech company, applied this approach through Label Studio Enterprise. The team used inter-annotator agreement metrics to verify correctness on high-variance financial data: request-for-quote classifications, indications of interest, and repo trading records. The result was a 120 percent increase in annotations per labeler. Throughput and correctness moved together rather than trading off.
The agent stopped; that's not the same as getting it right
The dashboard said "done." What it couldn't surface: the system measured the finish signal, not the system state. Those are different things, and conflating them is precisely how a blank database row generates three days of downstream reports before anyone notices.
Before deploying any agent into a workflow where outputs propagate, define what correctness means for that specific state change. Which field should hold which value? Which document should reflect which decision? Then build a review step that checks it. Completion metrics tell you the agent is running. Correctness metrics tell you whether it's worth running.
Is task completion rate still worth tracking if I measure correctness?
Completion rate remains a useful operational metric for identifying where agents stall or encounter infrastructure blockers. While correctness measures the quality of the outcome, completion identifies friction in the workflow, such as environment setup and dependency resolution, which account for over 50% of agent failures.
How do I measure correctness when the task is too complex for real-time human review?
For high-velocity or expert-level tasks, teams use a three-dimensional evaluation model covering task completion, data retrieval accuracy, and result completeness (AssetOpsBench, 2026). This involves using automated "judges" calibrated against a ground-truth set to identify state changes that fall outside of acceptable logical constraints.
What is the cost trade-off of adding a correctness verification step?
Adding verification increases token consumption and latency but prevents the negative economic value of silent failures. The alpha-value metric integrates success rates with compute costs and developer salaries to show that a completed task with a wrong output often costs more in human remediation than the automation saved.
Which industries face the highest risk from incorrect task completion?
Finance, healthcare, and legal sectors face the highest risks because agent outputs often propagate into regulated system states. In capital markets, for example, firms like Sense Street use inter-annotator agreement to verify the correctness of high-variance data like repo trading records where a "successful" but wrong update creates significant liability.
How can I tell if a model is hallucinating success?
A hallucination of success occurs when an agent returns a plausible completion log while the underlying system state remains unchanged. This is common in enterprise workflows where agents achieve only 9% to 19% success despite following sequential steps, requiring state-based validation rather than just inspecting the agent's self-reported reasoning.