Agent Conversation: deepseek-reasoner
Model: deepseek-reasoner
Total steps: 7
Messages: 16
UserStep 0
Task: llm-dllm-demask-strategy
Masked Diffusion LM: Demasking Strategy
Research Question
Design a better demasking (decoding) strategy for masked diffusion language models. The strategy must generalize across different decoding regimes:
- Block-based semi-autoregressive decoding for downstream task accuracy (LLaDA on MATH/HumanEval, following the KLASS protocol)
- Fully-parallel decoding for open-ended text generation (Dream on prefix-conditioned C4 continuation, measured by perplexity / diversity)
Background
Masked diffusion LMs (LLaDA, Dream) generate by starting from a fully masked
generation region and iteratively unmasking over steps denoising iterations.
A demasking strategy decides at each step:
- Schedule: how many tokens to unmask
- Position selection: which masked positions to unmask
- Token assignment: what token id to place
Decoding can be semi-autoregressive (when block_length < gen_length,
process one block at a time) or fully parallel (block_length == gen_length, all positions decoded together).
What You Can Modify
Edit the DemaskDecoder class in LLaDA/custom_demask_eval.py
(lines 59-151).
Interface
class DemaskDecoder:
def __init__(self, mask_id, temperature=0.0,
conf_threshold=0.9, kl_threshold=0.01, history_length=2):
...
@torch.no_grad()
def decode(self, model, input_ids, gen_length, steps, block_length):
# Returns (x_output [1, prompt_len + gen_length], used_steps)
get_num_transfer_tokens(mask, steps) is available outside the editable
region — returns the uniform schedule (mask.sum() // steps per step).
Constraints
gen_length % block_length == 0. When equal, decoding is fully parallel.- Process blocks sequentially (no early-decoding into later blocks).
- Always return
[1, prompt_len + gen_length]. used_stepscounts model forward passes (lower = more efficient).
Evaluation
Benchmarks
| Label | Task | Model | gen_len | steps | block_len | Metrics |
|---|---|---|---|---|---|---|
llada-math | MATH-500 | LLaDA-8B-Instruct | 256 | 256 | 64 | accuracy + avg_steps |
llada-humaneval | HumanEval (164) | LLaDA-8B-Instruct | 256 | 256 | 64 | accuracy + avg_steps |
dream-text | C4 prefix-continuation (256 samples, 32-tok prefix → 224-tok continuation) | Dream-v0-Instruct-7B | 224 | 256 | 224 | gen_ppl + MAUVE + entropy + rep2 + avg_steps |
Metrics
| Metric | Direction | Where | Description |
|---|---|---|---|
accuracy | ↑ | math/humaneval | exact-match (MATH) or pass@1 (HumanEval) |
gen_ppl | ↓ | text | Conditional perplexity via GPT-2-Large |
mauve | ↑ | text | Distributional similarity to C4 reference text |
entropy | ↑ | text | Bigram entropy (lexical diversity) |
rep2 | ↓ | text | Repeated bigram ratio |
avg_steps | ↓ | all | Actual model forward passes used |
AssistantStep 1edit
Tool Input
1{
2 "op": "replace",
3 "filename": "LLaDA/custom_demask_eval.py",
4 "start_line": 59,
5 "end_line": 151,
6 "content": "class DemaskDecoder:\n \"\"\"Adaptive Confidence-Margin Hybrid with Dynamic Thresholds (ACM-HDT).\n \n Combines confidence and margin (top1 - top2) with adaptive weighting.\n Uses stability (KL divergence) as a filter. Dynamically adjusts thresholds\n based on step progress and score distributions.\n \"\"\"\n\n def __init__(self, mask_id: int, temperature: float = 0.0,\n conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n history_length: int = 2):\n self.mask_id = mask_id\n self.temperature = temperature\n self.conf_threshold = conf_threshold\n self.kl_threshold = kl_threshold\n self.history_length = history_length\n # Hyperparameters for adaptive weighting\n self.margin_weight_start = 1.0 # weight for margin at start\n self.margin_weight_end = 0.2 # weight for margin at end\n self.conf_weight_start = 0.2 # weight for confidence at start\n self.conf_weight_end = 1.0 # weight for confidence at end\n # Dynamic threshold parameters\n self.conf_percentile = 0.7 # target percentile for confidence threshold\n self.margin_percentile = 0.7 # target percentile for margin threshold\n\n @torch.no_grad()\n def decode(self, model, input_ids, gen_length: int, steps: int,\n block_length: int):\n mid = self.mask_id\n x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n dtype=torch.long, device=model.device)\n x[:, :input_ids.shape[1]] = input_ids.clone()\n assert gen_length % block_length == 0\n num_blocks = gen_length // block_length\n assert steps % num_blocks == 0\n steps_per_block = steps // num_blocks\n\n V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n else model.config.vocab_size\n kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n dtype=torch.float64, device=x.device)\n p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n device=x.device)\n used = 0\n\n for b in range(num_blocks):\n bs = input_ids.shape[1] + b * block_length\n be = bs + block_length\n num_xfer = get_num_transfer_tokens(\n (x[:, bs:be] == mid), steps_per_block)\n\n for step in range(steps_per_block):\n mask_idx = (x == mid)\n block_m = torch.zeros_like(mask_idx)\n block_m[:, bs:be] = True\n mask_idx = mask_idx & block_m\n if not mask_idx.any():\n break\n\n logits = model(x).logits\n # Token assignment with temperature\n if self.temperature > 0:\n # Gumbel-softmax sampling\n gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n logits = (logits + gumbel_noise) / self.temperature\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n else:\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n\n # Confidence and margin\n sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n conf = sorted_probs[..., 0]\n margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n # KL stability\n eps = 1e-12\n kl = (p_curr * (torch.log(p_curr + eps)\n - torch.log(p_prev + eps))).sum(-1)\n kl_hist = torch.roll(kl_hist, -1, dims=-1)\n kl_hist[..., -1] = kl\n p_prev = p_curr.clone()\n\n # Stability mask\n if step >= self.history_length - 1:\n stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n else:\n stable = torch.zeros_like(conf, dtype=torch.bool)\n\n # Adaptive weighting based on step progress\n step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n margin_weight = self.margin_weight_start + (self.margin_weight_end - self.margin_weight_start) * step_ratio\n conf_weight = self.conf_weight_start + (self.conf_weight_end - self.conf_weight_start) * step_ratio\n # Normalize weights\n total = margin_weight + conf_weight\n margin_weight /= total\n conf_weight /= total\n\n # Combined score\n score = margin_weight * margin + conf_weight * conf\n\n # Dynamic thresholds based on percentiles\n if mask_idx.any():\n masked_conf = conf[mask_idx]\n if masked_conf.numel() > 0:\n conf_thresh = torch.quantile(masked_conf, self.conf_percentile)\n else:\n conf_thresh = self.conf_threshold\n masked_margin = margin[mask_idx]\n if masked_margin.numel() > 0:\n margin_thresh = torch.quantile(masked_margin, self.margin_percentile)\n else:\n margin_thresh = 0.0\n else:\n conf_thresh = self.conf_threshold\n margin_thresh = 0.0\n\n # Ready positions: stable AND (confidence > conf_thresh OR margin > margin_thresh)\n ready = stable & mask_idx & ((conf > conf_thresh) | (margin > margin_thresh))\n\n xfer = torch.zeros_like(x0, dtype=torch.bool)\n for j in range(ready.shape[0]):\n rdy = torch.where(ready[j])[0]\n k = int(num_xfer[j, step].item())\n if len(rdy) > 0:\n # Select top-k ready positions by score\n rdy_scores = score[j, rdy]\n if len(rdy) <= k:\n xfer[j, rdy] = True\n else:\n _, topk_local = torch.topk(rdy_scores, k)\n xfer[j, rdy[topk_local]] = True\n else:\n # Fallback: top-k by score among masked positions\n s = score[j].clone()\n s[~mask_idx[j]] = -float(\"inf\")\n _, topk = torch.topk(s, k)\n xfer[j, topk] = True\n\n x = torch.where(xfer, x0, x)\n used += 1\n return x, used"
7}Tool ResultStep 1
OK: Replaced lines 59..151 with 144 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–202.
[Current file: LLaDA/custom_demask_eval.py | editable: 59–202 | total: 542 lines]
59 class DemaskDecoder:
60 """Adaptive Confidence-Margin Hybrid with Dynamic Thresholds (ACM-HDT).
61
... (138 more lines) ...
200 x = torch.where(xfer, x0, x)
201 used += 1
202 return x, usedAssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### llada-math (scripts/llada_math.sh) Generation progress (llada-math): TRAIN_METRICS: math 460/500 acc=0.326 avg_steps=256.0 TRAIN_METRICS: math 470/500 acc=0.330 avg_steps=256.0 TRAIN_METRICS: math 480/500 acc=0.331 avg_steps=256.0 TRAIN_METRICS: math 490/500 acc=0.331 avg_steps=256.0 TRAIN_METRICS: math 500/500 acc=0.328 avg_steps=256.0 accuracy_llada-math: 0.3280 avg_steps_llada-math: 256.0000 n_samples_llada-math: 500.0000 ### llada-humaneval (scripts/llada_humaneval.sh) Generation progress (llada-humaneval): TRAIN_METRICS: humaneval 120/164 pass@1=0.483 avg_steps=256.0 TRAIN_METRICS: humaneval 130/164 pass@1=0.469 avg_steps=256.0 TRAIN_METRICS: humaneval 140/164 pass@1=0.436 avg_steps=256.0 TRAIN_METRICS: humaneval 150/164 pass@1=0.427 avg_steps=256.0 TRAIN_METRICS: humaneval 160/164 pass@1=0.419 avg_steps=256.0 accuracy_llada-humaneval: 0.4146 avg_steps_llada-humaneval: 256.0000 n_s
AssistantStep 3edit
Tool Input
1{
2 "op": "replace",
3 "filename": "LLaDA/custom_demask_eval.py",
4 "start_line": 59,
5 "end_line": 202,
6 "content": "class DemaskDecoder:\n \"\"\"Adaptive Confidence-Margin Hybrid with Aggressive Ready Selection (ACM-ARS).\n \n Combines confidence and margin with adaptive weighting.\n Uses stability as a filter. Unmasks all positions that are stable and\n have confidence > conf_threshold OR margin > margin_threshold.\n Falls back to top-k by combined score.\n \"\"\"\n\n def __init__(self, mask_id: int, temperature: float = 0.0,\n conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n history_length: int = 2):\n self.mask_id = mask_id\n self.temperature = temperature\n self.conf_threshold = conf_threshold\n self.kl_threshold = kl_threshold\n self.history_length = history_length\n # Hyperparameters for adaptive weighting\n self.margin_weight_start = 1.0 # weight for margin at start\n self.margin_weight_end = 0.2 # weight for margin at end\n self.conf_weight_start = 0.2 # weight for confidence at start\n self.conf_weight_end = 1.0 # weight for confidence at end\n # Fixed thresholds for ready selection\n self.margin_threshold = 0.3 # margin threshold for ready\n\n @torch.no_grad()\n def decode(self, model, input_ids, gen_length: int, steps: int,\n block_length: int):\n mid = self.mask_id\n x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n dtype=torch.long, device=model.device)\n x[:, :input_ids.shape[1]] = input_ids.clone()\n assert gen_length % block_length == 0\n num_blocks = gen_length // block_length\n assert steps % num_blocks == 0\n steps_per_block = steps // num_blocks\n\n V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n else model.config.vocab_size\n kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n dtype=torch.float64, device=x.device)\n p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n device=x.device)\n used = 0\n\n for b in range(num_blocks):\n bs = input_ids.shape[1] + b * block_length\n be = bs + block_length\n num_xfer = get_num_transfer_tokens(\n (x[:, bs:be] == mid), steps_per_block)\n\n for step in range(steps_per_block):\n mask_idx = (x == mid)\n block_m = torch.zeros_like(mask_idx)\n block_m[:, bs:be] = True\n mask_idx = mask_idx & block_m\n if not mask_idx.any():\n break\n\n logits = model(x).logits\n # Token assignment with temperature\n if self.temperature > 0:\n # Gumbel-softmax sampling\n gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n logits = (logits + gumbel_noise) / self.temperature\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n else:\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n\n # Confidence and margin\n sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n conf = sorted_probs[..., 0]\n margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n # KL stability\n eps = 1e-12\n kl = (p_curr * (torch.log(p_curr + eps)\n - torch.log(p_prev + eps))).sum(-1)\n kl_hist = torch.roll(kl_hist, -1, dims=-1)\n kl_hist[..., -1] = kl\n p_prev = p_curr.clone()\n\n # Stability mask\n if step >= self.history_length - 1:\n stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n else:\n stable = torch.zeros_like(conf, dtype=torch.bool)\n\n # Adaptive weighting based on step progress\n step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n margin_weight = self.margin_weight_start + (self.margin_weight_end - self.margin_weight_start) * step_ratio\n conf_weight = self.conf_weight_start + (self.conf_weight_end - self.conf_weight_start) * step_ratio\n # Normalize weights\n total = margin_weight + conf_weight\n margin_weight /= total\n conf_weight /= total\n\n # Combined score\n score = margin_weight * margin + conf_weight * conf\n\n # Ready positions: stable AND (confidence > conf_threshold OR margin > margin_threshold)\n ready = stable & mask_idx & ((conf > self.conf_threshold) | (margin > self.margin_threshold))\n\n xfer = torch.zeros_like(x0, dtype=torch.bool)\n for j in range(ready.shape[0]):\n rdy = torch.where(ready[j])[0]\n k = int(num_xfer[j, step].item())\n if len(rdy) > 0:\n # Unmask ALL ready positions (aggressive)\n xfer[j, rdy] = True\n else:\n # Fallback: top-k by combined score among masked positions\n s = score[j].clone()\n s[~mask_idx[j]] = -float(\"inf\")\n _, topk = torch.topk(s, k)\n xfer[j, topk] = True\n\n x = torch.where(xfer, x0, x)\n used += 1\n return x, used"
7}Tool ResultStep 3
OK: Replaced lines 59..202 with 123 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–181.
[Current file: LLaDA/custom_demask_eval.py | editable: 59–181 | total: 521 lines]
59 class DemaskDecoder:
60 """Adaptive Confidence-Margin Hybrid with Aggressive Ready Selection (ACM-ARS).
61
... (117 more lines) ...
179 x = torch.where(xfer, x0, x)
180 used += 1
181 return x, usedAssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final) ### llada-math (scripts/llada_math.sh) Generation progress (llada-math): TRAIN_METRICS: math 460/500 acc=0.285 avg_steps=113.8 TRAIN_METRICS: math 470/500 acc=0.289 avg_steps=113.4 TRAIN_METRICS: math 480/500 acc=0.292 avg_steps=113.6 TRAIN_METRICS: math 490/500 acc=0.292 avg_steps=113.9 TRAIN_METRICS: math 500/500 acc=0.290 avg_steps=114.6 accuracy_llada-math: 0.2900 avg_steps_llada-math: 114.6200 n_samples_llada-math: 500.0000 ### llada-humaneval (scripts/llada_humaneval.sh) Generation progress (llada-humaneval): TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=73.8 TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=74.1 TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=75.1 TRAIN_METRICS: humaneval 150/164 pass@1=0.393 avg_steps=74.7 TRAIN_METRICS: humaneval 160/164 pass@1=0.381 avg_steps=74.6 accuracy_llada-humaneval: 0.3780 avg_steps_llada-humaneval: 74.8000 n_samples_
AssistantStep 5edit
Tool Input
1{
2 "op": "replace",
3 "filename": "LLaDA/custom_demask_eval.py",
4 "start_line": 59,
5 "end_line": 181,
6 "content": "class DemaskDecoder:\n \"\"\"Task-Adaptive Demasking Strategy (TADS).\n \n Adapts behavior based on decoding regime:\n - Semi-autoregressive (multiple blocks): stricter thresholds, emphasis on confidence.\n - Fully parallel (single block): looser thresholds, emphasis on margin for diversity.\n Uses stability filtering and aggressive ready unmasking.\n \"\"\"\n\n def __init__(self, mask_id: int, temperature: float = 0.0,\n conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n history_length: int = 2):\n self.mask_id = mask_id\n self.temperature = temperature\n self.conf_threshold = conf_threshold\n self.kl_threshold = kl_threshold\n self.history_length = history_length\n\n @torch.no_grad()\n def decode(self, model, input_ids, gen_length: int, steps: int,\n block_length: int):\n mid = self.mask_id\n x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n dtype=torch.long, device=model.device)\n x[:, :input_ids.shape[1]] = input_ids.clone()\n assert gen_length % block_length == 0\n num_blocks = gen_length // block_length\n assert steps % num_blocks == 0\n steps_per_block = steps // num_blocks\n\n # Detect decoding regime\n semi_autoregressive = (num_blocks > 1) # block_length < gen_length\n fully_parallel = (num_blocks == 1)\n\n # Adaptive hyperparameters based on regime\n if semi_autoregressive:\n # Stricter thresholds for reasoning tasks\n conf_thresh = 0.95\n margin_thresh = 0.5\n margin_weight_start = 0.3\n margin_weight_end = 0.1\n conf_weight_start = 0.7\n conf_weight_end = 0.9\n ready_condition = 'and' # require both confidence AND margin thresholds\n else:\n # Looser thresholds for open-ended generation\n conf_thresh = 0.7\n margin_thresh = 0.2\n margin_weight_start = 1.0\n margin_weight_end = 0.3\n conf_weight_start = 0.0\n conf_weight_end = 0.7\n ready_condition = 'or' # require confidence OR margin threshold\n\n V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n else model.config.vocab_size\n kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n dtype=torch.float64, device=x.device)\n p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n device=x.device)\n used = 0\n\n for b in range(num_blocks):\n bs = input_ids.shape[1] + b * block_length\n be = bs + block_length\n num_xfer = get_num_transfer_tokens(\n (x[:, bs:be] == mid), steps_per_block)\n\n for step in range(steps_per_block):\n mask_idx = (x == mid)\n block_m = torch.zeros_like(mask_idx)\n block_m[:, bs:be] = True\n mask_idx = mask_idx & block_m\n if not mask_idx.any():\n break\n\n logits = model(x).logits\n # Token assignment with temperature\n if self.temperature > 0:\n # Gumbel-softmax sampling\n gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n logits = (logits + gumbel_noise) / self.temperature\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n else:\n p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n x0 = torch.argmax(p_curr, dim=-1)\n\n # Confidence and margin\n sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n conf = sorted_probs[..., 0]\n margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n # KL stability\n eps = 1e-12\n kl = (p_curr * (torch.log(p_curr + eps)\n - torch.log(p_prev + eps))).sum(-1)\n kl_hist = torch.roll(kl_hist, -1, dims=-1)\n kl_hist[..., -1] = kl\n p_prev = p_curr.clone()\n\n # Stability mask\n if step >= self.history_length - 1:\n stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n else:\n stable = torch.zeros_like(conf, dtype=torch.bool)\n\n # Adaptive weighting based on step progress\n step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n margin_weight = margin_weight_start + (margin_weight_end - margin_weight_start) * step_ratio\n conf_weight = conf_weight_start + (conf_weight_end - conf_weight_start) * step_ratio\n # Normalize weights\n total = margin_weight + conf_weight\n if total > 0:\n margin_weight /= total\n conf_weight /= total\n else:\n margin_weight = 0.5\n conf_weight = 0.5\n\n # Combined score\n score = margin_weight * margin + conf_weight * conf\n\n # Ready positions\n if ready_condition == 'and':\n ready = stable & mask_idx & (conf > conf_thresh) & (margin > margin_thresh)\n else: # 'or'\n ready = stable & mask_idx & ((conf > conf_thresh) | (margin > margin_thresh))\n\n xfer = torch.zeros_like(x0, dtype=torch.bool)\n for j in range(ready.shape[0]):\n rdy = torch.where(ready[j])[0]\n k = int(num_xfer[j, step].item())\n if len(rdy) > 0:\n # Unmask ALL ready positions (aggressive)\n xfer[j, rdy] = True\n else:\n # Fallback: top-k by combined score among masked positions\n s = score[j].clone()\n s[~mask_idx[j]] = -float(\"inf\")\n _, topk = torch.topk(s, k)\n xfer[j, topk] = True\n\n x = torch.where(xfer, x0, x)\n used += 1\n return x, used"
7}Tool ResultStep 5
OK: Replaced lines 59..181 with 147 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–205.
[Current file: LLaDA/custom_demask_eval.py | editable: 59–205 | total: 545 lines]
59 class DemaskDecoder:
60 """Task-Adaptive Demasking Strategy (TADS).
61
... (141 more lines) ...
203 x = torch.where(xfer, x0, x)
204 used += 1
205 return x, usedAssistantStep 6test
Tool ResultStep 6
[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### llada-math (scripts/llada_math.sh) Generation progress (llada-math): TRAIN_METRICS: math 460/500 acc=0.309 avg_steps=160.5 TRAIN_METRICS: math 470/500 acc=0.311 avg_steps=159.9 TRAIN_METRICS: math 480/500 acc=0.312 avg_steps=160.2 TRAIN_METRICS: math 490/500 acc=0.312 avg_steps=160.2 TRAIN_METRICS: math 500/500 acc=0.310 avg_steps=160.6 accuracy_llada-math: 0.3100 avg_steps_llada-math: 160.6400 n_samples_llada-math: 500.0000 ### llada-humaneval (scripts/llada_humaneval.sh) Generation progress (llada-humaneval): TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=97.3 TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=99.7 TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=100.0 TRAIN_METRICS: humaneval 150/164 pass@1=0.387 avg_steps=99.4 TRAIN_METRICS:
AssistantStep 7submit
Tool Input
1{
2 "n": 2
3}Tool ResultStep 7
[submit] Submitting result from test #2 as final. ### llada-math (scripts/llada_math.sh) Generation progress (llada-math): TRAIN_METRICS: math 460/500 acc=0.285 avg_steps=113.8 TRAIN_METRICS: math 470/500 acc=0.289 avg_steps=113.4 TRAIN_METRICS: math 480/500 acc=0.292 avg_steps=113.6 TRAIN_METRICS: math 490/500 acc=0.292 avg_steps=113.9 TRAIN_METRICS: math 500/500 acc=0.290 avg_steps=114.6 accuracy_llada-math: 0.2900 avg_steps_llada-math: 114.6200 n_samples_llada-math: 500.0000 ### llada-humaneval (scripts/llada_humaneval.sh) Generation progress (llada-humaneval): TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=73.8 TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=74.1 TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=75.1 TRAIN_METRICS: humaneval 150/164 pass@1=0.393 avg_steps=74.7 TRAIN_METRICS: humaneval 160/164 pass@1=0.381 avg_steps=74.6 accuracy_llada-humaneval: 0.3780 avg_steps_llada-humaneval: 74.8000 n_samples_llada-humaneval: 164.0000 ### dream-text (s