Agent Conversation: deepseek-reasoner

Model: deepseek-reasoner
Total steps: 7
Messages: 16
UserStep 0

Task: llm-dllm-demask-strategy

Masked Diffusion LM: Demasking Strategy

Research Question

Design a better demasking (decoding) strategy for masked diffusion language models. The strategy must generalize across different decoding regimes:

  • Block-based semi-autoregressive decoding for downstream task accuracy (LLaDA on MATH/HumanEval, following the KLASS protocol)
  • Fully-parallel decoding for open-ended text generation (Dream on prefix-conditioned C4 continuation, measured by perplexity / diversity)

Background

Masked diffusion LMs (LLaDA, Dream) generate by starting from a fully masked generation region and iteratively unmasking over steps denoising iterations. A demasking strategy decides at each step:

  1. Schedule: how many tokens to unmask
  2. Position selection: which masked positions to unmask
  3. Token assignment: what token id to place

Decoding can be semi-autoregressive (when block_length < gen_length, process one block at a time) or fully parallel (block_length == gen_length, all positions decoded together).

What You Can Modify

Edit the DemaskDecoder class in LLaDA/custom_demask_eval.py (lines 59-151).

Interface

class DemaskDecoder:
    def __init__(self, mask_id, temperature=0.0,
                 conf_threshold=0.9, kl_threshold=0.01, history_length=2):
        ...

    @torch.no_grad()
    def decode(self, model, input_ids, gen_length, steps, block_length):
        # Returns (x_output [1, prompt_len + gen_length], used_steps)

get_num_transfer_tokens(mask, steps) is available outside the editable region — returns the uniform schedule (mask.sum() // steps per step).

Constraints

  • gen_length % block_length == 0. When equal, decoding is fully parallel.
  • Process blocks sequentially (no early-decoding into later blocks).
  • Always return [1, prompt_len + gen_length].
  • used_steps counts model forward passes (lower = more efficient).

Evaluation

Benchmarks

LabelTaskModelgen_lenstepsblock_lenMetrics
llada-mathMATH-500LLaDA-8B-Instruct25625664accuracy + avg_steps
llada-humanevalHumanEval (164)LLaDA-8B-Instruct25625664accuracy + avg_steps
dream-textC4 prefix-continuation (256 samples, 32-tok prefix → 224-tok continuation)Dream-v0-Instruct-7B224256224gen_ppl + MAUVE + entropy + rep2 + avg_steps

Metrics

MetricDirectionWhereDescription
accuracymath/humanevalexact-match (MATH) or pass@1 (HumanEval)
gen_ppltextConditional perplexity via GPT-2-Large
mauvetextDistributional similarity to C4 reference text
entropytextBigram entropy (lexical diversity)
rep2textRepeated bigram ratio
avg_stepsallActual model forward passes used

AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "LLaDA/custom_demask_eval.py",
4  "start_line": 59,
5  "end_line": 151,
6  "content": "class DemaskDecoder:\n    \"\"\"Adaptive Confidence-Margin Hybrid with Dynamic Thresholds (ACM-HDT).\n    \n    Combines confidence and margin (top1 - top2) with adaptive weighting.\n    Uses stability (KL divergence) as a filter. Dynamically adjusts thresholds\n    based on step progress and score distributions.\n    \"\"\"\n\n    def __init__(self, mask_id: int, temperature: float = 0.0,\n                 conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n                 history_length: int = 2):\n        self.mask_id = mask_id\n        self.temperature = temperature\n        self.conf_threshold = conf_threshold\n        self.kl_threshold = kl_threshold\n        self.history_length = history_length\n        # Hyperparameters for adaptive weighting\n        self.margin_weight_start = 1.0   # weight for margin at start\n        self.margin_weight_end = 0.2     # weight for margin at end\n        self.conf_weight_start = 0.2     # weight for confidence at start\n        self.conf_weight_end = 1.0       # weight for confidence at end\n        # Dynamic threshold parameters\n        self.conf_percentile = 0.7       # target percentile for confidence threshold\n        self.margin_percentile = 0.7     # target percentile for margin threshold\n\n    @torch.no_grad()\n    def decode(self, model, input_ids, gen_length: int, steps: int,\n               block_length: int):\n        mid = self.mask_id\n        x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n                       dtype=torch.long, device=model.device)\n        x[:, :input_ids.shape[1]] = input_ids.clone()\n        assert gen_length % block_length == 0\n        num_blocks = gen_length // block_length\n        assert steps % num_blocks == 0\n        steps_per_block = steps // num_blocks\n\n        V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n                                       else model.config.vocab_size\n        kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n                              dtype=torch.float64, device=x.device)\n        p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n                             device=x.device)\n        used = 0\n\n        for b in range(num_blocks):\n            bs = input_ids.shape[1] + b * block_length\n            be = bs + block_length\n            num_xfer = get_num_transfer_tokens(\n                (x[:, bs:be] == mid), steps_per_block)\n\n            for step in range(steps_per_block):\n                mask_idx = (x == mid)\n                block_m = torch.zeros_like(mask_idx)\n                block_m[:, bs:be] = True\n                mask_idx = mask_idx & block_m\n                if not mask_idx.any():\n                    break\n\n                logits = model(x).logits\n                # Token assignment with temperature\n                if self.temperature > 0:\n                    # Gumbel-softmax sampling\n                    gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n                    gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n                    logits = (logits + gumbel_noise) / self.temperature\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n                else:\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n\n                # Confidence and margin\n                sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n                conf = sorted_probs[..., 0]\n                margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n                # KL stability\n                eps = 1e-12\n                kl = (p_curr * (torch.log(p_curr + eps)\n                                - torch.log(p_prev + eps))).sum(-1)\n                kl_hist = torch.roll(kl_hist, -1, dims=-1)\n                kl_hist[..., -1] = kl\n                p_prev = p_curr.clone()\n\n                # Stability mask\n                if step >= self.history_length - 1:\n                    stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n                else:\n                    stable = torch.zeros_like(conf, dtype=torch.bool)\n\n                # Adaptive weighting based on step progress\n                step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n                margin_weight = self.margin_weight_start + (self.margin_weight_end - self.margin_weight_start) * step_ratio\n                conf_weight = self.conf_weight_start + (self.conf_weight_end - self.conf_weight_start) * step_ratio\n                # Normalize weights\n                total = margin_weight + conf_weight\n                margin_weight /= total\n                conf_weight /= total\n\n                # Combined score\n                score = margin_weight * margin + conf_weight * conf\n\n                # Dynamic thresholds based on percentiles\n                if mask_idx.any():\n                    masked_conf = conf[mask_idx]\n                    if masked_conf.numel() > 0:\n                        conf_thresh = torch.quantile(masked_conf, self.conf_percentile)\n                    else:\n                        conf_thresh = self.conf_threshold\n                    masked_margin = margin[mask_idx]\n                    if masked_margin.numel() > 0:\n                        margin_thresh = torch.quantile(masked_margin, self.margin_percentile)\n                    else:\n                        margin_thresh = 0.0\n                else:\n                    conf_thresh = self.conf_threshold\n                    margin_thresh = 0.0\n\n                # Ready positions: stable AND (confidence > conf_thresh OR margin > margin_thresh)\n                ready = stable & mask_idx & ((conf > conf_thresh) | (margin > margin_thresh))\n\n                xfer = torch.zeros_like(x0, dtype=torch.bool)\n                for j in range(ready.shape[0]):\n                    rdy = torch.where(ready[j])[0]\n                    k = int(num_xfer[j, step].item())\n                    if len(rdy) > 0:\n                        # Select top-k ready positions by score\n                        rdy_scores = score[j, rdy]\n                        if len(rdy) <= k:\n                            xfer[j, rdy] = True\n                        else:\n                            _, topk_local = torch.topk(rdy_scores, k)\n                            xfer[j, rdy[topk_local]] = True\n                    else:\n                        # Fallback: top-k by score among masked positions\n                        s = score[j].clone()\n                        s[~mask_idx[j]] = -float(\"inf\")\n                        _, topk = torch.topk(s, k)\n                        xfer[j, topk] = True\n\n                x = torch.where(xfer, x0, x)\n                used += 1\n        return x, used"
7}
Tool ResultStep 1
OK: Replaced lines 59..151 with 144 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–202.

[Current file: LLaDA/custom_demask_eval.py | editable: 59–202 | total: 542 lines]
    59  class DemaskDecoder:
    60      """Adaptive Confidence-Margin Hybrid with Dynamic Thresholds (ACM-HDT).
    61      
       ... (138 more lines) ...
   200                  x = torch.where(xfer, x0, x)
   201                  used += 1
   202          return x, used
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)

### llada-math (scripts/llada_math.sh)
Generation progress (llada-math):
TRAIN_METRICS: math 460/500 acc=0.326 avg_steps=256.0
TRAIN_METRICS: math 470/500 acc=0.330 avg_steps=256.0
TRAIN_METRICS: math 480/500 acc=0.331 avg_steps=256.0
TRAIN_METRICS: math 490/500 acc=0.331 avg_steps=256.0
TRAIN_METRICS: math 500/500 acc=0.328 avg_steps=256.0
accuracy_llada-math: 0.3280
avg_steps_llada-math: 256.0000
n_samples_llada-math: 500.0000

### llada-humaneval (scripts/llada_humaneval.sh)
Generation progress (llada-humaneval):
TRAIN_METRICS: humaneval 120/164 pass@1=0.483 avg_steps=256.0
TRAIN_METRICS: humaneval 130/164 pass@1=0.469 avg_steps=256.0
TRAIN_METRICS: humaneval 140/164 pass@1=0.436 avg_steps=256.0
TRAIN_METRICS: humaneval 150/164 pass@1=0.427 avg_steps=256.0
TRAIN_METRICS: humaneval 160/164 pass@1=0.419 avg_steps=256.0
accuracy_llada-humaneval: 0.4146
avg_steps_llada-humaneval: 256.0000
n_s
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "LLaDA/custom_demask_eval.py",
4  "start_line": 59,
5  "end_line": 202,
6  "content": "class DemaskDecoder:\n    \"\"\"Adaptive Confidence-Margin Hybrid with Aggressive Ready Selection (ACM-ARS).\n    \n    Combines confidence and margin with adaptive weighting.\n    Uses stability as a filter. Unmasks all positions that are stable and\n    have confidence > conf_threshold OR margin > margin_threshold.\n    Falls back to top-k by combined score.\n    \"\"\"\n\n    def __init__(self, mask_id: int, temperature: float = 0.0,\n                 conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n                 history_length: int = 2):\n        self.mask_id = mask_id\n        self.temperature = temperature\n        self.conf_threshold = conf_threshold\n        self.kl_threshold = kl_threshold\n        self.history_length = history_length\n        # Hyperparameters for adaptive weighting\n        self.margin_weight_start = 1.0   # weight for margin at start\n        self.margin_weight_end = 0.2     # weight for margin at end\n        self.conf_weight_start = 0.2     # weight for confidence at start\n        self.conf_weight_end = 1.0       # weight for confidence at end\n        # Fixed thresholds for ready selection\n        self.margin_threshold = 0.3      # margin threshold for ready\n\n    @torch.no_grad()\n    def decode(self, model, input_ids, gen_length: int, steps: int,\n               block_length: int):\n        mid = self.mask_id\n        x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n                       dtype=torch.long, device=model.device)\n        x[:, :input_ids.shape[1]] = input_ids.clone()\n        assert gen_length % block_length == 0\n        num_blocks = gen_length // block_length\n        assert steps % num_blocks == 0\n        steps_per_block = steps // num_blocks\n\n        V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n                                       else model.config.vocab_size\n        kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n                              dtype=torch.float64, device=x.device)\n        p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n                             device=x.device)\n        used = 0\n\n        for b in range(num_blocks):\n            bs = input_ids.shape[1] + b * block_length\n            be = bs + block_length\n            num_xfer = get_num_transfer_tokens(\n                (x[:, bs:be] == mid), steps_per_block)\n\n            for step in range(steps_per_block):\n                mask_idx = (x == mid)\n                block_m = torch.zeros_like(mask_idx)\n                block_m[:, bs:be] = True\n                mask_idx = mask_idx & block_m\n                if not mask_idx.any():\n                    break\n\n                logits = model(x).logits\n                # Token assignment with temperature\n                if self.temperature > 0:\n                    # Gumbel-softmax sampling\n                    gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n                    gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n                    logits = (logits + gumbel_noise) / self.temperature\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n                else:\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n\n                # Confidence and margin\n                sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n                conf = sorted_probs[..., 0]\n                margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n                # KL stability\n                eps = 1e-12\n                kl = (p_curr * (torch.log(p_curr + eps)\n                                - torch.log(p_prev + eps))).sum(-1)\n                kl_hist = torch.roll(kl_hist, -1, dims=-1)\n                kl_hist[..., -1] = kl\n                p_prev = p_curr.clone()\n\n                # Stability mask\n                if step >= self.history_length - 1:\n                    stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n                else:\n                    stable = torch.zeros_like(conf, dtype=torch.bool)\n\n                # Adaptive weighting based on step progress\n                step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n                margin_weight = self.margin_weight_start + (self.margin_weight_end - self.margin_weight_start) * step_ratio\n                conf_weight = self.conf_weight_start + (self.conf_weight_end - self.conf_weight_start) * step_ratio\n                # Normalize weights\n                total = margin_weight + conf_weight\n                margin_weight /= total\n                conf_weight /= total\n\n                # Combined score\n                score = margin_weight * margin + conf_weight * conf\n\n                # Ready positions: stable AND (confidence > conf_threshold OR margin > margin_threshold)\n                ready = stable & mask_idx & ((conf > self.conf_threshold) | (margin > self.margin_threshold))\n\n                xfer = torch.zeros_like(x0, dtype=torch.bool)\n                for j in range(ready.shape[0]):\n                    rdy = torch.where(ready[j])[0]\n                    k = int(num_xfer[j, step].item())\n                    if len(rdy) > 0:\n                        # Unmask ALL ready positions (aggressive)\n                        xfer[j, rdy] = True\n                    else:\n                        # Fallback: top-k by combined score among masked positions\n                        s = score[j].clone()\n                        s[~mask_idx[j]] = -float(\"inf\")\n                        _, topk = torch.topk(s, k)\n                        xfer[j, topk] = True\n\n                x = torch.where(xfer, x0, x)\n                used += 1\n        return x, used"
7}
Tool ResultStep 3
OK: Replaced lines 59..202 with 123 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–181.

[Current file: LLaDA/custom_demask_eval.py | editable: 59–181 | total: 521 lines]
    59  class DemaskDecoder:
    60      """Adaptive Confidence-Margin Hybrid with Aggressive Ready Selection (ACM-ARS).
    61      
       ... (117 more lines) ...
   179                  x = torch.where(xfer, x0, x)
   180                  used += 1
   181          return x, used
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)

### llada-math (scripts/llada_math.sh)
Generation progress (llada-math):
TRAIN_METRICS: math 460/500 acc=0.285 avg_steps=113.8
TRAIN_METRICS: math 470/500 acc=0.289 avg_steps=113.4
TRAIN_METRICS: math 480/500 acc=0.292 avg_steps=113.6
TRAIN_METRICS: math 490/500 acc=0.292 avg_steps=113.9
TRAIN_METRICS: math 500/500 acc=0.290 avg_steps=114.6
accuracy_llada-math: 0.2900
avg_steps_llada-math: 114.6200
n_samples_llada-math: 500.0000

### llada-humaneval (scripts/llada_humaneval.sh)
Generation progress (llada-humaneval):
TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=73.8
TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=74.1
TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=75.1
TRAIN_METRICS: humaneval 150/164 pass@1=0.393 avg_steps=74.7
TRAIN_METRICS: humaneval 160/164 pass@1=0.381 avg_steps=74.6
accuracy_llada-humaneval: 0.3780
avg_steps_llada-humaneval: 74.8000
n_samples_
AssistantStep 5edit
Tool Input
1{
2  "op": "replace",
3  "filename": "LLaDA/custom_demask_eval.py",
4  "start_line": 59,
5  "end_line": 181,
6  "content": "class DemaskDecoder:\n    \"\"\"Task-Adaptive Demasking Strategy (TADS).\n    \n    Adapts behavior based on decoding regime:\n    - Semi-autoregressive (multiple blocks): stricter thresholds, emphasis on confidence.\n    - Fully parallel (single block): looser thresholds, emphasis on margin for diversity.\n    Uses stability filtering and aggressive ready unmasking.\n    \"\"\"\n\n    def __init__(self, mask_id: int, temperature: float = 0.0,\n                 conf_threshold: float = 0.9, kl_threshold: float = 0.01,\n                 history_length: int = 2):\n        self.mask_id = mask_id\n        self.temperature = temperature\n        self.conf_threshold = conf_threshold\n        self.kl_threshold = kl_threshold\n        self.history_length = history_length\n\n    @torch.no_grad()\n    def decode(self, model, input_ids, gen_length: int, steps: int,\n               block_length: int):\n        mid = self.mask_id\n        x = torch.full((1, input_ids.shape[1] + gen_length), mid,\n                       dtype=torch.long, device=model.device)\n        x[:, :input_ids.shape[1]] = input_ids.clone()\n        assert gen_length % block_length == 0\n        num_blocks = gen_length // block_length\n        assert steps % num_blocks == 0\n        steps_per_block = steps // num_blocks\n\n        # Detect decoding regime\n        semi_autoregressive = (num_blocks > 1)  # block_length < gen_length\n        fully_parallel = (num_blocks == 1)\n\n        # Adaptive hyperparameters based on regime\n        if semi_autoregressive:\n            # Stricter thresholds for reasoning tasks\n            conf_thresh = 0.95\n            margin_thresh = 0.5\n            margin_weight_start = 0.3\n            margin_weight_end = 0.1\n            conf_weight_start = 0.7\n            conf_weight_end = 0.9\n            ready_condition = 'and'  # require both confidence AND margin thresholds\n        else:\n            # Looser thresholds for open-ended generation\n            conf_thresh = 0.7\n            margin_thresh = 0.2\n            margin_weight_start = 1.0\n            margin_weight_end = 0.3\n            conf_weight_start = 0.0\n            conf_weight_end = 0.7\n            ready_condition = 'or'   # require confidence OR margin threshold\n\n        V = model.lm_head.out_features if hasattr(model, \"lm_head\") \\\n                                       else model.config.vocab_size\n        kl_hist = torch.zeros((1, x.shape[1], self.history_length),\n                              dtype=torch.float64, device=x.device)\n        p_prev = torch.zeros((1, x.shape[1], V), dtype=torch.float64,\n                             device=x.device)\n        used = 0\n\n        for b in range(num_blocks):\n            bs = input_ids.shape[1] + b * block_length\n            be = bs + block_length\n            num_xfer = get_num_transfer_tokens(\n                (x[:, bs:be] == mid), steps_per_block)\n\n            for step in range(steps_per_block):\n                mask_idx = (x == mid)\n                block_m = torch.zeros_like(mask_idx)\n                block_m[:, bs:be] = True\n                mask_idx = mask_idx & block_m\n                if not mask_idx.any():\n                    break\n\n                logits = model(x).logits\n                # Token assignment with temperature\n                if self.temperature > 0:\n                    # Gumbel-softmax sampling\n                    gumbel_noise = torch.rand_like(logits, dtype=logits.dtype)\n                    gumbel_noise = -torch.log(-torch.log(gumbel_noise + 1e-10) + 1e-10)\n                    logits = (logits + gumbel_noise) / self.temperature\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n                else:\n                    p_curr = F.softmax(logits.to(torch.float64), dim=-1)\n                    x0 = torch.argmax(p_curr, dim=-1)\n\n                # Confidence and margin\n                sorted_probs, _ = torch.sort(p_curr, dim=-1, descending=True)\n                conf = sorted_probs[..., 0]\n                margin = sorted_probs[..., 0] - sorted_probs[..., 1]\n\n                # KL stability\n                eps = 1e-12\n                kl = (p_curr * (torch.log(p_curr + eps)\n                                - torch.log(p_prev + eps))).sum(-1)\n                kl_hist = torch.roll(kl_hist, -1, dims=-1)\n                kl_hist[..., -1] = kl\n                p_prev = p_curr.clone()\n\n                # Stability mask\n                if step >= self.history_length - 1:\n                    stable = torch.all(kl_hist < self.kl_threshold, dim=-1)\n                else:\n                    stable = torch.zeros_like(conf, dtype=torch.bool)\n\n                # Adaptive weighting based on step progress\n                step_ratio = (b * steps_per_block + step) / (num_blocks * steps_per_block)\n                margin_weight = margin_weight_start + (margin_weight_end - margin_weight_start) * step_ratio\n                conf_weight = conf_weight_start + (conf_weight_end - conf_weight_start) * step_ratio\n                # Normalize weights\n                total = margin_weight + conf_weight\n                if total > 0:\n                    margin_weight /= total\n                    conf_weight /= total\n                else:\n                    margin_weight = 0.5\n                    conf_weight = 0.5\n\n                # Combined score\n                score = margin_weight * margin + conf_weight * conf\n\n                # Ready positions\n                if ready_condition == 'and':\n                    ready = stable & mask_idx & (conf > conf_thresh) & (margin > margin_thresh)\n                else:  # 'or'\n                    ready = stable & mask_idx & ((conf > conf_thresh) | (margin > margin_thresh))\n\n                xfer = torch.zeros_like(x0, dtype=torch.bool)\n                for j in range(ready.shape[0]):\n                    rdy = torch.where(ready[j])[0]\n                    k = int(num_xfer[j, step].item())\n                    if len(rdy) > 0:\n                        # Unmask ALL ready positions (aggressive)\n                        xfer[j, rdy] = True\n                    else:\n                        # Fallback: top-k by combined score among masked positions\n                        s = score[j].clone()\n                        s[~mask_idx[j]] = -float(\"inf\")\n                        _, topk = torch.topk(s, k)\n                        xfer[j, topk] = True\n\n                x = torch.where(xfer, x0, x)\n                used += 1\n        return x, used"
7}
Tool ResultStep 5
OK: Replaced lines 59..181 with 147 line(s) in LLaDA/custom_demask_eval.py. Editable range: 59–205.

[Current file: LLaDA/custom_demask_eval.py | editable: 59–205 | total: 545 lines]
    59  class DemaskDecoder:
    60      """Task-Adaptive Demasking Strategy (TADS).
    61      
       ... (141 more lines) ...
   203                  x = torch.where(xfer, x0, x)
   204                  used += 1
   205          return x, used
AssistantStep 6test
Tool ResultStep 6
[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final)

[NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer.

### llada-math (scripts/llada_math.sh)
Generation progress (llada-math):
TRAIN_METRICS: math 460/500 acc=0.309 avg_steps=160.5
TRAIN_METRICS: math 470/500 acc=0.311 avg_steps=159.9
TRAIN_METRICS: math 480/500 acc=0.312 avg_steps=160.2
TRAIN_METRICS: math 490/500 acc=0.312 avg_steps=160.2
TRAIN_METRICS: math 500/500 acc=0.310 avg_steps=160.6
accuracy_llada-math: 0.3100
avg_steps_llada-math: 160.6400
n_samples_llada-math: 500.0000

### llada-humaneval (scripts/llada_humaneval.sh)
Generation progress (llada-humaneval):
TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=97.3
TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=99.7
TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=100.0
TRAIN_METRICS: humaneval 150/164 pass@1=0.387 avg_steps=99.4
TRAIN_METRICS:
AssistantStep 7submit
Tool Input
1{
2  "n": 2
3}
Tool ResultStep 7
[submit] Submitting result from test #2 as final.

### llada-math (scripts/llada_math.sh)
Generation progress (llada-math):
TRAIN_METRICS: math 460/500 acc=0.285 avg_steps=113.8
TRAIN_METRICS: math 470/500 acc=0.289 avg_steps=113.4
TRAIN_METRICS: math 480/500 acc=0.292 avg_steps=113.6
TRAIN_METRICS: math 490/500 acc=0.292 avg_steps=113.9
TRAIN_METRICS: math 500/500 acc=0.290 avg_steps=114.6
accuracy_llada-math: 0.2900
avg_steps_llada-math: 114.6200
n_samples_llada-math: 500.0000

### llada-humaneval (scripts/llada_humaneval.sh)
Generation progress (llada-humaneval):
TRAIN_METRICS: humaneval 120/164 pass@1=0.442 avg_steps=73.8
TRAIN_METRICS: humaneval 130/164 pass@1=0.431 avg_steps=74.1
TRAIN_METRICS: humaneval 140/164 pass@1=0.400 avg_steps=75.1
TRAIN_METRICS: humaneval 150/164 pass@1=0.393 avg_steps=74.7
TRAIN_METRICS: humaneval 160/164 pass@1=0.381 avg_steps=74.6
accuracy_llada-humaneval: 0.3780
avg_steps_llada-humaneval: 74.8000
n_samples_llada-humaneval: 164.0000

### dream-text (s