Reinforcement Learning from Verifiable Rewards
![](/_astro/a895a8986032ac0ab2155dacc3a4b7960010c90c-1440x756_2hxiJC.png)
Reinforcement Learning with Verifiable Rewards is among the leading training strategies for injecting learning signals into LLMs, successfully employed by models such as DeepSeek R1 and Tülu 3. In a nutshell, Verifiable Rewards are simple functions that provide a clear-cut, binary ground truth signal - typically a “1” (correct) or “0” (incorrect) - to indicate whether a model’s output meets a predefined correctness criterion.
What are the Benefits of Verifiable Rewards?
Unlike traditional, neural reward functions utilized in RLHF, verifiable rewards offer several advantages:
Ground Truth Alignment
Their binary nature ensures a direct, bias-free objective connection to ground truth, making them ideal for precision-critical tasks like mathematical problem-solving and code execution.
Ease of Design and Evaluation
Verifiable rewards provide a quick and easy way to design robust RL environments, allowing subject matter experts to establish clear correctness criteria without deep machine learning expertise. Their explicit nature facilitates automated evaluation, minimizing reliance on human judgment and ensuring efficient and scalable integration into reinforcement learning pipelines.
Robustness Against Reward Hacking
Because verifiable rewards rely on strict, rule-based evaluations rather than learned approximations, there is little room for the LLM to “hack” the system. With a binary indicator, the model receives no partial credit for outputs that only superficially meet the criteria.
In essence, verifiable rewards serve as a straightforward “yes/no” gate that informs the learning algorithm whether a particular output meets the necessary conditions, thereby streamlining the training process by providing unambiguous feedback.
Types of Verifiable Rewards
Here are a few simple examples of verifiable rewards:
Mathematical Correctness
For tasks involving numerical or symbolic computations, verifiable rewards check the accuracy of the mathematical solution.
Example: Using the GSM8k dataset, which is made up of grade school math word problems, an LLM generates a step-by-step solution to an algebra problem and the final result was collected after four hashtag symbols “####”. Then a string-matching algorithm is used to compare with the ground truth answer. If completely correct, score 1 point; if the format is correct, score 0.1 points; if the format is incorrect, score 0 points.
For more on the mathematical correctness example, check out this paper from DeepSeek or this documentation from the package Verl, available in Python.
Code Execution
In scenarios where LLMs are used to generate code, verifiable rewards are derived from executing the code and comparing the outcome to an expected result: unit tests, exceptions etc.
Example: In the multi-turn code synthesis setup, verifiable rewards can be assigned based on the execution results of generated code against test cases. The reward function evaluates correctness at the end of each episode as following:
- +1, if end of episode and all tests pass
- -1, if end of episode and any test fails
- -0.2, if no valid code generated
For more on the code execution example, see this 2024 paper.
Instruction-Following and Formatting Rewards
These rewards evaluate whether the LLM output strictly adheres to a given set of instructions or guidelines. Typically, it can be done via a simple string-matching.
Example: In the model response, one of the following criteria should be satisfied:
- Include keywords {keyword1}, {keyword2} in your response
- An entire response should be in {language}, no other language is allowed.
- Refrain from the use of any {punctuation}
- Put reasoning process between <think> and </think> tags.
def strict_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [1.0 if match else 0.0 for match in matches]
For more on the instruction following and formatting example, see this paper on instruction following in LLMs and this paper from DeepSeek.
Additional Categories
Verifiable rewards can also extend to other domains, such as:
- Factual Accuracy: Checking whether the output contains verifiable facts.
- Logical Consistency: Ensuring that arguments or narratives are internally consistent. Ensure solving propositional logic reasoning problems
- Compliance with Regulations: In sensitive applications, outputs can be automatically screened for adherence to regulatory standards, like the presence of sensitive data and named entities.
Moreover, scaling high-diversity RL environments by carefully designing verifiable criteria can lead to eliciting high-quality LLM reasoning regarding specific problems and domains, as well as collecting and distilling reasoning traces into smaller, more efficient specialized models.
How to Design a Verifiable Reward Function
Designing a verifiable reward function requires expert knowledge, domain expertise, and structured data interfaces. This ensures reinforcement learning (RL) systems optimize for measurable, high-quality outputs while avoiding biases and misalignments. The following steps outline a robust approach:
1. Collect and Curate Ground Truth Data
- Leverage Expert Knowledge: Collaborate with domain experts to define what constitutes a correct or high-quality response.
- Use High-Quality Datasets: Collect reliable, well-labeled datasets from publicly available sources, synthetic generation, or human annotation.
- Decontaminate Data: Ensure the dataset does not overlap with evaluation benchmarks to prevent overfitting.
- Diverse Sampling: Gather data that spans a wide range of scenarios, ensuring robustness in generalization.
2. Determine a Rule-Based Reward Function Based on Ground Truth
- Define Verification Criteria: Identify tasks with verifiable outcomes, such as mathematical proofs, code execution, instruction adherence or domain-specific targets.
- Exact Match and Heuristics: Use deterministic rules to check correctness (e.g., exact match in math answers, passing test cases in code, matching the predefined categories or taxonomy etc.)
- Avoid Model-Based Rewarding: Prioritize rule-based over model-driven reward functions to reduce reward hacking and ensure reliability.
- Multi-Level Scoring: Implement tiered scoring mechanisms to reward partial correctness where applicable.
3. Validate the Reward Model Based on Generated Examples
- Run Controlled Tests: Generate model outputs and measure how well the reward function distinguishes correct from incorrect responses.
- Evaluate for Robustness: Ensure the function avoids penalizing correct responses due to formatting issues or minor variations.
- A/B Testing with RL Agents: Compare performance between models trained with and without the verifiable reward function.
4. Integrate it in the RL Process
- Use Reinforcement Learning with Verifiable Rewards (RLVR): Train policies that receive a reward only when responses meet verification criteria.
- Optimize with Proximal Policy Optimization (PPO): Balance reward maximization with controlled model divergence to prevent over-optimization.
- Monitor for reward outputs: Regularly audit outputs to ensure the model does not exploit loopholes in the reward function.
By following these steps, RL-based systems can achieve high performance while maintaining transparency and reliability in decision-making.