llm-ttrl-reward

Language Modelsttrlrigorous codebase

Description

Test-Time RL: Reward Estimation Strategy

Objective

Design and implement a better ensemble-based reward estimator for Test-Time Reinforcement Learning (TTRL). Your code goes in the compute_ttrl_rewards() function in custom_ttrl_reward.py.

Background

TTRL trains LLMs without ground truth labels at test time. For each prompt, it samples N responses, estimates a pseudo ground truth from the ensemble, then trains with RL using rewards derived from matching (or not matching) this estimate. The reward estimation strategy is the core algorithmic innovation — it determines label quality and training signal strength.

The standard approach (Majority Voting) extracts answers from all N responses, counts occurrences, and picks the most common answer. Responses matching this pseudo GT get reward 1, others get 0. However, this approach has limitations:

String-based counting may miss semantically equivalent answers
Binary rewards provide no gradient signal for "partially correct" responses
Equal weighting ignores response quality or confidence signals
No mechanism to handle low-confidence cases where the majority may be wrong

Evaluation

Your reward estimator is evaluated on 3 benchmarks of increasing difficulty, all training Qwen2.5-1.5B with TTRL (32 votes, 16 training samples, 4 GPUs):

Benchmark	Difficulty	Dataset	Episodes	Response Length
AMC	AMC competition math (83 problems)	TTRL built-in (AMC-TTT)	12	2048 tokens
MATH	Competition math (500 problems)	TTRL built-in (MATH-TTT)	5	2048 tokens
AIME	AIME competition (30 problems)	TTRL built-in (AIME-TTT)	25	3072 tokens

Key metrics (from TTRL training logs):

label_accuracy: How often the estimated pseudo GT matches the true ground truth
reward_accuracy: How often the estimated reward matches what the true GT would give
ground_truth_reward: Accuracy when evaluated against actual ground truth (primary metric)
pass@k: Whether any of the k training samples is correct

Higher ground_truth_reward means the model is actually learning to solve more problems correctly, regardless of whether the pseudo labels are perfect.

Interface Contract

The training loop calls your function once per prompt:

def compute_ttrl_rewards(
    prompt_responses: List[str],   # N response strings
    prompt_text: str,              # The original question
) -> tuple[str, float, Optional[list[float]]]:
    # Returns: (estimated_gt, confidence, per_sample_rewards)

Return values:

estimated_gt: The estimated ground truth answer string (used for pseudo labeling)
confidence: A float in [0, 1] indicating confidence in the estimate
per_sample_rewards: If None, the framework uses binary reward (1 if response matches estimated_gt, 0 otherwise). If provided as a list of N floats, these are used directly as reward signals for training.

Available utilities (already imported):

extract_answer(text) — Extract \boxed{} answer from a response string
simplify_expression_string(s) — Simplify a math expression string for comparison
grade(pred, gt) — Check if two math answers are semantically equivalent
TTRLClusterCounter(equiv_fn) — Cluster items by an equivalence function (e.g., grade)

Code

custom_ttrl_reward.py

EditableRead-only

1"""Custom TTRL reward estimation module.
2
3The agent modifies `compute_ttrl_rewards()` to implement novel
4reward estimation strategies for test-time RL without ground truth.
5"""
6from typing import List, Optional
7from collections import Counter
8
9# Utilities from TTRL (NOT editable, just imported)
10from verl.utils.reward_score.ttrl_math import extract_answer, simplify_expression_string, grade
11from verl.utils.reward_score.ttrl_math.cluster import TTRLClusterCounter
12
13
14# ── EDITABLE REGION START ──
15def compute_ttrl_rewards(

Additional context files (read-only):

ttrl/verl/verl/trainer/ppo/ttrl_utils.py
ttrl/verl/verl/utils/reward_score/ttrl_math/__init__.py
ttrl/verl/verl/utils/reward_score/ttrl_math/cluster.py

Results

No results available yet.