llm-ttrl-reward
Description
Test-Time RL: Reward Estimation Strategy
Objective
Design and implement a better ensemble-based reward estimator for Test-Time Reinforcement Learning (TTRL). Your code goes in the compute_ttrl_rewards() function in custom_ttrl_reward.py.
Background
TTRL trains LLMs without ground truth labels at test time. For each prompt, it samples N responses, estimates a pseudo ground truth from the ensemble, then trains with RL using rewards derived from matching (or not matching) this estimate. The reward estimation strategy is the core algorithmic innovation — it determines label quality and training signal strength.
The standard approach (Majority Voting) extracts answers from all N responses, counts occurrences, and picks the most common answer. Responses matching this pseudo GT get reward 1, others get 0. However, this approach has limitations:
- String-based counting may miss semantically equivalent answers
- Binary rewards provide no gradient signal for "partially correct" responses
- Equal weighting ignores response quality or confidence signals
- No mechanism to handle low-confidence cases where the majority may be wrong
Evaluation
Your reward estimator is evaluated on 3 benchmarks of increasing difficulty, all training Qwen2.5-1.5B with TTRL (32 votes, 16 training samples, 4 GPUs):
| Benchmark | Difficulty | Dataset | Episodes | Response Length |
|---|---|---|---|---|
| AMC | AMC competition math (83 problems) | TTRL built-in (AMC-TTT) | 12 | 2048 tokens |
| MATH | Competition math (500 problems) | TTRL built-in (MATH-TTT) | 5 | 2048 tokens |
| AIME | AIME competition (30 problems) | TTRL built-in (AIME-TTT) | 25 | 3072 tokens |
Key metrics (from TTRL training logs):
- label_accuracy: How often the estimated pseudo GT matches the true ground truth
- reward_accuracy: How often the estimated reward matches what the true GT would give
- ground_truth_reward: Accuracy when evaluated against actual ground truth (primary metric)
- pass@k: Whether any of the k training samples is correct
Higher ground_truth_reward means the model is actually learning to solve more problems correctly, regardless of whether the pseudo labels are perfect.
Interface Contract
The training loop calls your function once per prompt:
def compute_ttrl_rewards(
prompt_responses: List[str], # N response strings
prompt_text: str, # The original question
) -> tuple[str, float, Optional[list[float]]]:
# Returns: (estimated_gt, confidence, per_sample_rewards)
Return values:
estimated_gt: The estimated ground truth answer string (used for pseudo labeling)confidence: A float in [0, 1] indicating confidence in the estimateper_sample_rewards: If None, the framework uses binary reward (1 if response matches estimated_gt, 0 otherwise). If provided as a list of N floats, these are used directly as reward signals for training.
Available utilities (already imported):
extract_answer(text)— Extract\boxed{}answer from a response stringsimplify_expression_string(s)— Simplify a math expression string for comparisongrade(pred, gt)— Check if two math answers are semantically equivalentTTRLClusterCounter(equiv_fn)— Cluster items by an equivalence function (e.g.,grade)
Code
1"""Custom TTRL reward estimation module.23The agent modifies `compute_ttrl_rewards()` to implement novel4reward estimation strategies for test-time RL without ground truth.5"""6from typing import List, Optional7from collections import Counter89# Utilities from TTRL (NOT editable, just imported)10from verl.utils.reward_score.ttrl_math import extract_answer, simplify_expression_string, grade11from verl.utils.reward_score.ttrl_math.cluster import TTRLClusterCounter121314# ── EDITABLE REGION START ──15def compute_ttrl_rewards(
Additional context files (read-only):
ttrl/verl/verl/trainer/ppo/ttrl_utils.pyttrl/verl/verl/utils/reward_score/ttrl_math/__init__.pyttrl/verl/verl/utils/reward_score/ttrl_math/cluster.py
Results
No results available yet.