llm-ttrl-reward

Language Modelsttrlrigorous codebase

Description

Test-Time RL: Reward Estimation Strategy

Objective

Design and implement a better ensemble-based reward estimator for Test-Time Reinforcement Learning (TTRL). Your code goes in the compute_ttrl_rewards() function in custom_ttrl_reward.py.

Background

TTRL trains LLMs without ground truth labels at test time. For each prompt, it samples N responses, estimates a pseudo ground truth from the ensemble, then trains with RL using rewards derived from matching (or not matching) this estimate. The reward estimation strategy is the core algorithmic innovation — it determines label quality and training signal strength.

The standard approach (Majority Voting) extracts answers from all N responses, counts occurrences, and picks the most common answer. Responses matching this pseudo GT get reward 1, others get 0. However, this approach has limitations:

  • String-based counting may miss semantically equivalent answers
  • Binary rewards provide no gradient signal for "partially correct" responses
  • Equal weighting ignores response quality or confidence signals
  • No mechanism to handle low-confidence cases where the majority may be wrong

Evaluation

Your reward estimator is evaluated on 3 benchmarks of increasing difficulty, all training Qwen2.5-1.5B with TTRL (32 votes, 16 training samples, 4 GPUs):

BenchmarkDifficultyDatasetEpisodesResponse Length
AMCAMC competition math (83 problems)TTRL built-in (AMC-TTT)122048 tokens
MATHCompetition math (500 problems)TTRL built-in (MATH-TTT)52048 tokens
AIMEAIME competition (30 problems)TTRL built-in (AIME-TTT)253072 tokens

Key metrics (from TTRL training logs):

  • label_accuracy: How often the estimated pseudo GT matches the true ground truth
  • reward_accuracy: How often the estimated reward matches what the true GT would give
  • ground_truth_reward: Accuracy when evaluated against actual ground truth (primary metric)
  • pass@k: Whether any of the k training samples is correct

Higher ground_truth_reward means the model is actually learning to solve more problems correctly, regardless of whether the pseudo labels are perfect.

Interface Contract

The training loop calls your function once per prompt:

def compute_ttrl_rewards(
    prompt_responses: List[str],   # N response strings
    prompt_text: str,              # The original question
) -> tuple[str, float, Optional[list[float]]]:
    # Returns: (estimated_gt, confidence, per_sample_rewards)

Return values:

  • estimated_gt: The estimated ground truth answer string (used for pseudo labeling)
  • confidence: A float in [0, 1] indicating confidence in the estimate
  • per_sample_rewards: If None, the framework uses binary reward (1 if response matches estimated_gt, 0 otherwise). If provided as a list of N floats, these are used directly as reward signals for training.

Available utilities (already imported):

  • extract_answer(text) — Extract \boxed{} answer from a response string
  • simplify_expression_string(s) — Simplify a math expression string for comparison
  • grade(pred, gt) — Check if two math answers are semantically equivalent
  • TTRLClusterCounter(equiv_fn) — Cluster items by an equivalence function (e.g., grade)

Code

custom_ttrl_reward.py
EditableRead-only
1"""Custom TTRL reward estimation module.
2
3The agent modifies `compute_ttrl_rewards()` to implement novel
4reward estimation strategies for test-time RL without ground truth.
5"""
6from typing import List, Optional
7from collections import Counter
8
9# Utilities from TTRL (NOT editable, just imported)
10from verl.utils.reward_score.ttrl_math import extract_answer, simplify_expression_string, grade
11from verl.utils.reward_score.ttrl_math.cluster import TTRLClusterCounter
12
13
14# ── EDITABLE REGION START ──
15def compute_ttrl_rewards(

Additional context files (read-only):

  • ttrl/verl/verl/trainer/ppo/ttrl_utils.py
  • ttrl/verl/verl/utils/reward_score/ttrl_math/__init__.py
  • ttrl/verl/verl/utils/reward_score/ttrl_math/cluster.py

Results

No results available yet.