llm-dllm-demask-strategy
Description
Masked Diffusion LM: Demasking Strategy
Research Question
Design a better demasking (decoding) strategy for masked diffusion language models. The strategy must generalize across different decoding regimes:
- Block-based semi-autoregressive decoding for downstream task accuracy (LLaDA on MATH/HumanEval, following the KLASS protocol)
- Fully-parallel decoding for open-ended text generation (Dream on prefix-conditioned C4 continuation, measured by perplexity / diversity)
Background
Masked diffusion LMs (LLaDA, Dream) generate by starting from a fully masked
generation region and iteratively unmasking over steps denoising iterations.
A demasking strategy decides at each step:
- Schedule: how many tokens to unmask
- Position selection: which masked positions to unmask
- Token assignment: what token id to place
Decoding can be semi-autoregressive (when block_length < gen_length,
process one block at a time) or fully parallel (block_length == gen_length, all positions decoded together).
What You Can Modify
Edit the DemaskDecoder class in LLaDA/custom_demask_eval.py
(lines 59-151).
Interface
class DemaskDecoder:
def __init__(self, mask_id, temperature=0.0,
conf_threshold=0.9, kl_threshold=0.01, history_length=2):
...
@torch.no_grad()
def decode(self, model, input_ids, gen_length, steps, block_length):
# Returns (x_output [1, prompt_len + gen_length], used_steps)
get_num_transfer_tokens(mask, steps) is available outside the editable
region — returns the uniform schedule (mask.sum() // steps per step).
Constraints
gen_length % block_length == 0. When equal, decoding is fully parallel.- Process blocks sequentially (no early-decoding into later blocks).
- Always return
[1, prompt_len + gen_length]. used_stepscounts model forward passes (lower = more efficient).
Evaluation
Benchmarks
| Label | Task | Model | gen_len | steps | block_len | Metrics |
|---|---|---|---|---|---|---|
llada-math | MATH-500 | LLaDA-8B-Instruct | 256 | 256 | 64 | accuracy + avg_steps |
llada-humaneval | HumanEval (164) | LLaDA-8B-Instruct | 256 | 256 | 64 | accuracy + avg_steps |
dream-text | C4 prefix-continuation (256 samples, 32-tok prefix → 224-tok continuation) | Dream-v0-Instruct-7B | 224 | 256 | 224 | gen_ppl + MAUVE + entropy + rep2 + avg_steps |
Metrics
| Metric | Direction | Where | Description |
|---|---|---|---|
accuracy | ↑ | math/humaneval | exact-match (MATH) or pass@1 (HumanEval) |
gen_ppl | ↓ | text | Conditional perplexity via GPT-2-Large |
mauve | ↑ | text | Distributional similarity to C4 reference text |
entropy | ↑ | text | Bigram entropy (lexical diversity) |
rep2 | ↓ | text | Repeated bigram ratio |
avg_steps | ↓ | all | Actual model forward passes used |
Protocol references
- MATH/HumanEval: KLASS (Kim et al., NeurIPS 2025; arXiv 2511.05664).
We use KLASS's exact
data/math_test.json, prompts, andutils.pyfor answer extraction (extract_math_answer,compare_answers). - Text generation: prefix-conditioned C4 continuation, similar to MDLM / ReMDM evaluation but with conditioning on a 32-token prefix.
Baselines (from KLASS algorithms)
confidence_greedy— LLaDA'slow_confidenceremasking: top-k by max prob.topk_margin— Dream'stopk_margin: top-k by (top1 prob − top2 prob).klass— SOTA: KL-adaptive stability + confidence thresholds.
Reference Performance
LLaDA paper (EVAL.md, gen_length=256/steps=256/block_length=256): MATH = 30.3%, HumanEval = 32.9% on LLaDA-8B-Base.
KLASS paper on LLaDA-8B-Instruct, MATH (with block_length=64): ~33.8% (KLASS), reducing steps by 40-70%.
Code
1"""Downstream task evaluation (MATH, HumanEval) for masked diffusion LMs.23Following the KLASS evaluation protocol (Kim et al., NeurIPS 2025):4https://github.com/shkim0116/KLASS5"""67from __future__ import annotations89import argparse10import gzip11import json12import os13import re14import sys15import time
Results
| Model | Type | accuracy llada-math ↑ | avg steps llada-math ↑ | n samples llada-math ↑ | accuracy llada-humaneval ↑ | avg steps llada-humaneval ↑ | n samples llada-humaneval ↑ | gen ppl dream-text ↓ | mauve dream-text ↑ | entropy dream-text ↑ | rep2 dream-text ↑ | avg steps dream-text ↑ | n samples dream-text ↑ | gen ppl llada-16step ↓ | mauve llada-16step ↑ | entropy llada-16step ↑ | rep2 llada-16step ↑ | avg steps llada-16step ↑ | gen ppl llada-64step ↓ | mauve llada-64step ↑ | entropy llada-64step ↑ | rep2 llada-64step ↑ | avg steps llada-64step ↑ | gen ppl dream-16step ↓ | mauve dream-16step ↑ | entropy dream-16step ↑ | rep2 dream-16step ↑ | avg steps dream-16step ↑ | gen ppl dream-8step ↓ | mauve dream-8step ↑ | entropy dream-8step ↑ | rep2 dream-8step ↑ | avg steps dream-8step ↑ | gen ppl dream-64step ↓ | mauve dream-64step ↑ | entropy dream-64step ↑ | rep2 dream-64step ↑ | avg steps dream-64step ↑ | gen ppl dream-128step ↓ | mauve dream-128step ↑ | entropy dream-128step ↑ | rep2 dream-128step ↑ | avg steps dream-128step ↑ | gen ppl llada-256step ↓ | mauve llada-256step ↑ | entropy llada-256step ↑ | rep2 llada-256step ↑ | avg steps llada-256step ↑ | accuracy dream-humaneval ↑ | avg steps dream-humaneval ↑ | n samples dream-humaneval ↑ | accuracy dream-math ↑ | avg steps dream-math ↑ | n samples dream-math ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| confidence_greedy | baseline | 0.316 | 256.000 | 500.000 | 0.366 | 256.000 | 164.000 | 170.609 | 0.032 | 6.413 | 0.013 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | - | - | - | - | - | - | 9999.000 | 0.031 | 4.769 | 0.612 | 16.000 | 9999.000 | 0.048 | 9.424 | 0.648 | 64.000 | 669.218 | 0.030 | 7.836 | 0.039 | 16.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 669.218 | 0.030 | 7.836 | 0.039 | 16.000 | 383.115 | 0.023 | 7.797 | 0.095 | 8.000 | 108.939 | 0.141 | 5.421 | 0.002 | 64.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 669.218 | 0.030 | 7.836 | 0.039 | 16.000 | - | - | - | - | - | - | - | - | - | - | 136.184 | 0.056 | 5.630 | 0.015 | 128.000 | 9999.000 | 0.097 | 12.220 | 0.658 | 224.000 | - | - | - | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0.000 | 256.000 | 164.000 | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| confidence_greedy | baseline | - | - | - | - | - | - | 170.609 | 0.032 | 6.413 | 0.013 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| klass | baseline | 0.334 | 127.860 | 500.000 | 0.372 | 93.810 | 164.000 | 64.219 | 0.068 | 6.324 | 0.016 | 88.540 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| klass | baseline | 0.334 | 127.860 | 500.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0.004 | 44.860 | 500.000 |
| klass_kl | baseline | - | - | - | - | - | - | - | - | - | - | - | - | 9999.000 | 0.037 | 3.770 | 0.551 | 15.210 | 9999.000 | 0.024 | 4.368 | 0.621 | 51.270 | 299.423 | 0.029 | 6.371 | 0.053 | 15.880 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| klass_kl | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 299.423 | 0.029 | 6.371 | 0.053 | 15.880 | 138.932 | 0.021 | 6.483 | 0.098 | 8.000 | 74.680 | 0.047 | 4.416 | 0.015 | 51.260 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| klass_kl | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 299.423 | 0.029 | 6.371 | 0.053 | 15.880 | - | - | - | - | - | - | - | - | - | - | 127.860 | 0.060 | 5.298 | 0.014 | 80.810 | 9999.000 | 0.121 | 11.267 | 0.565 | 113.000 | - | - | - | - | - | - |
| klass_kl | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0.000 | 129.000 | 164.000 | - | - | - |
| prophet | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 671.774 | 0.014 | 7.910 | 0.044 | 11.300 | 403.231 | 0.023 | 7.811 | 0.089 | 5.610 | 170.392 | 0.023 | 6.556 | 0.014 | 48.430 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| prophet | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 671.774 | 0.014 | 7.910 | 0.044 | 11.300 | - | - | - | - | - | - | - | - | - | - | 182.622 | 0.018 | 6.379 | 0.025 | 96.800 | 9999.000 | 0.103 | 12.225 | 0.657 | 181.640 | - | - | - | - | - | - |
| prophet | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0.000 | 208.460 | 164.000 | - | - | - |
| random | baseline | - | - | - | - | - | - | - | - | - | - | - | - | 9999.000 | 0.046 | 4.289 | 0.649 | 16.000 | 9999.000 | 0.080 | 6.495 | 0.576 | 64.000 | 9999.000 | 0.012 | 6.610 | 0.255 | 16.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| random | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 9999.000 | 0.012 | 6.610 | 0.255 | 16.000 | 9999.000 | 0.011 | 6.580 | 0.252 | 8.000 | 9999.000 | 0.010 | 6.101 | 0.288 | 64.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| topk_margin | baseline | 0.322 | 256.000 | 500.000 | 0.390 | 256.000 | 164.000 | 237.050 | 0.112 | 5.926 | 0.025 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| topk_margin | baseline | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| topk_margin | baseline | - | - | - | - | - | - | 237.050 | 0.112 | 5.926 | 0.025 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| topk_margin | baseline | - | - | - | 0.390 | 256.000 | 164.000 | 237.050 | 0.112 | 5.926 | 0.025 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| anthropic/claude-opus-4.6 | vanilla | 0.284 | 57.430 | 500.000 | 0.378 | 56.260 | 164.000 | 39.504 | 0.085 | 6.026 | 0.034 | 40.820 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| deepseek-reasoner | vanilla | 0.328 | 256.000 | 500.000 | 0.415 | 256.000 | 164.000 | 221.498 | 0.055 | 5.276 | 0.015 | 224.000 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| google/gemini-3.1-pro-preview | vanilla | 0.318 | 118.310 | 500.000 | 0.402 | 91.450 | 164.000 | 12.428 | 0.094 | 4.328 | 0.080 | 49.090 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| openai/gpt-5.4 | vanilla | 0.308 | 83.040 | 500.000 | 0.378 | 71.680 | 164.000 | 34.420 | 0.093 | 6.232 | 0.004 | 35.320 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| qwen/qwen3.6-plus | vanilla | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| anthropic/claude-opus-4.6 | agent | 0.304 | 121.730 | 500.000 | 0.402 | 89.390 | 164.000 | 28.442 | 0.230 | 6.134 | 0.013 | 35.230 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| deepseek-reasoner | agent | 0.290 | 114.620 | 500.000 | 0.378 | 74.800 | 164.000 | 26.749 | 0.210 | 6.031 | 0.007 | 26.720 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| google/gemini-3.1-pro-preview | agent | 0.318 | 118.310 | 500.000 | 0.402 | 91.450 | 164.000 | 12.428 | 0.094 | 4.328 | 0.080 | 49.090 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| openai/gpt-5.4 | agent | 0.336 | 149.510 | 500.000 | 0.378 | 137.630 | 164.000 | 27.182 | 0.102 | 6.216 | 0.004 | 31.980 | 256.000 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| qwen/qwen3.6-plus | agent | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |