llm-dllm-demask-strategy

Language ModelsLLaDArigorous codebase

Description

Masked Diffusion LM: Demasking Strategy

Research Question

Design a better demasking (decoding) strategy for masked diffusion language models. The strategy must generalize across different decoding regimes:

Block-based semi-autoregressive decoding for downstream task accuracy (LLaDA on MATH/HumanEval, following the KLASS protocol)
Fully-parallel decoding for open-ended text generation (Dream on prefix-conditioned C4 continuation, measured by perplexity / diversity)

Background

Masked diffusion LMs (LLaDA, Dream) generate by starting from a fully masked generation region and iteratively unmasking over steps denoising iterations. A demasking strategy decides at each step:

Schedule: how many tokens to unmask
Position selection: which masked positions to unmask
Token assignment: what token id to place

Decoding can be semi-autoregressive (when block_length < gen_length, process one block at a time) or fully parallel (block_length == gen_length, all positions decoded together).

What You Can Modify

Edit the DemaskDecoder class in LLaDA/custom_demask_eval.py (lines 59-151).

Interface

class DemaskDecoder:
    def __init__(self, mask_id, temperature=0.0,
                 conf_threshold=0.9, kl_threshold=0.01, history_length=2):
        ...

    @torch.no_grad()
    def decode(self, model, input_ids, gen_length, steps, block_length):
        # Returns (x_output [1, prompt_len + gen_length], used_steps)

get_num_transfer_tokens(mask, steps) is available outside the editable region — returns the uniform schedule (mask.sum() // steps per step).

Constraints

gen_length % block_length == 0. When equal, decoding is fully parallel.
Process blocks sequentially (no early-decoding into later blocks).
Always return [1, prompt_len + gen_length].
used_steps counts model forward passes (lower = more efficient).

Evaluation

Benchmarks

Label	Task	Model	gen_len	steps	block_len	Metrics
`llada-math`	MATH-500	LLaDA-8B-Instruct	256	256	64	accuracy + avg_steps
`llada-humaneval`	HumanEval (164)	LLaDA-8B-Instruct	256	256	64	accuracy + avg_steps
`dream-text`	C4 prefix-continuation (256 samples, 32-tok prefix → 224-tok continuation)	Dream-v0-Instruct-7B	224	256	224	gen_ppl + MAUVE + entropy + rep2 + avg_steps

Metrics

Metric	Direction	Where	Description
`accuracy`	↑	math/humaneval	exact-match (MATH) or pass@1 (HumanEval)
`gen_ppl`	↓	text	Conditional perplexity via GPT-2-Large
`mauve`	↑	text	Distributional similarity to C4 reference text
`entropy`	↑	text	Bigram entropy (lexical diversity)
`rep2`	↓	text	Repeated bigram ratio
`avg_steps`	↓	all	Actual model forward passes used

Protocol references

MATH/HumanEval: KLASS (Kim et al., NeurIPS 2025; arXiv 2511.05664). We use KLASS's exact data/math_test.json, prompts, and utils.py for answer extraction (extract_math_answer, compare_answers).
Text generation: prefix-conditioned C4 continuation, similar to MDLM / ReMDM evaluation but with conditioning on a 32-token prefix.

Baselines (from KLASS algorithms)

confidence_greedy — LLaDA's low_confidence remasking: top-k by max prob.
topk_margin — Dream's topk_margin: top-k by (top1 prob − top2 prob).
klass — SOTA: KL-adaptive stability + confidence thresholds.

Reference Performance

LLaDA paper (EVAL.md, gen_length=256/steps=256/block_length=256): MATH = 30.3%, HumanEval = 32.9% on LLaDA-8B-Base.

KLASS paper on LLaDA-8B-Instruct, MATH (with block_length=64): ~33.8% (KLASS), reducing steps by 40-70%.

Code

custom_demask_eval.py

EditableRead-only

1"""Downstream task evaluation (MATH, HumanEval) for masked diffusion LMs.
2
3Following the KLASS evaluation protocol (Kim et al., NeurIPS 2025):
4  https://github.com/shkim0116/KLASS
5"""
6
7from __future__ import annotations
8
9import argparse
10import gzip
11import json
12import os
13import re
14import sys
15import time

Results

Model	Type	accuracy llada-math ↑	avg steps llada-math ↑	n samples llada-math ↑	accuracy llada-humaneval ↑	avg steps llada-humaneval ↑	n samples llada-humaneval ↑	gen ppl dream-text ↓	mauve dream-text ↑	entropy dream-text ↑	rep2 dream-text ↑	avg steps dream-text ↑	n samples dream-text ↑	gen ppl llada-16step ↓	mauve llada-16step ↑	entropy llada-16step ↑	rep2 llada-16step ↑	avg steps llada-16step ↑	gen ppl llada-64step ↓	mauve llada-64step ↑	entropy llada-64step ↑	rep2 llada-64step ↑	avg steps llada-64step ↑	gen ppl dream-16step ↓	mauve dream-16step ↑	entropy dream-16step ↑	rep2 dream-16step ↑	avg steps dream-16step ↑	gen ppl dream-8step ↓	mauve dream-8step ↑	entropy dream-8step ↑	rep2 dream-8step ↑	avg steps dream-8step ↑	gen ppl dream-64step ↓	mauve dream-64step ↑	entropy dream-64step ↑	rep2 dream-64step ↑	avg steps dream-64step ↑	gen ppl dream-128step ↓	mauve dream-128step ↑	entropy dream-128step ↑	rep2 dream-128step ↑	avg steps dream-128step ↑	gen ppl llada-256step ↓	mauve llada-256step ↑	entropy llada-256step ↑	rep2 llada-256step ↑	avg steps llada-256step ↑	accuracy dream-humaneval ↑	avg steps dream-humaneval ↑	n samples dream-humaneval ↑	accuracy dream-math ↑	avg steps dream-math ↑	n samples dream-math ↑
confidence_greedy	baseline	0.316	256.000	500.000	0.366	256.000	164.000	170.609	0.032	6.413	0.013	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	-	-	-	-	-	-	9999.000	0.031	4.769	0.612	16.000	9999.000	0.048	9.424	0.648	64.000	669.218	0.030	7.836	0.039	16.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	669.218	0.030	7.836	0.039	16.000	383.115	0.023	7.797	0.095	8.000	108.939	0.141	5.421	0.002	64.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	669.218	0.030	7.836	0.039	16.000	-	-	-	-	-	-	-	-	-	-	136.184	0.056	5.630	0.015	128.000	9999.000	0.097	12.220	0.658	224.000	-	-	-	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0.000	256.000	164.000	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
confidence_greedy	baseline	-	-	-	-	-	-	170.609	0.032	6.413	0.013	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
klass	baseline	0.334	127.860	500.000	0.372	93.810	164.000	64.219	0.068	6.324	0.016	88.540	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
klass	baseline	0.334	127.860	500.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0.004	44.860	500.000
klass_kl	baseline	-	-	-	-	-	-	-	-	-	-	-	-	9999.000	0.037	3.770	0.551	15.210	9999.000	0.024	4.368	0.621	51.270	299.423	0.029	6.371	0.053	15.880	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
klass_kl	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	299.423	0.029	6.371	0.053	15.880	138.932	0.021	6.483	0.098	8.000	74.680	0.047	4.416	0.015	51.260	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
klass_kl	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	299.423	0.029	6.371	0.053	15.880	-	-	-	-	-	-	-	-	-	-	127.860	0.060	5.298	0.014	80.810	9999.000	0.121	11.267	0.565	113.000	-	-	-	-	-	-
klass_kl	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0.000	129.000	164.000	-	-	-
prophet	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	671.774	0.014	7.910	0.044	11.300	403.231	0.023	7.811	0.089	5.610	170.392	0.023	6.556	0.014	48.430	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
prophet	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	671.774	0.014	7.910	0.044	11.300	-	-	-	-	-	-	-	-	-	-	182.622	0.018	6.379	0.025	96.800	9999.000	0.103	12.225	0.657	181.640	-	-	-	-	-	-
prophet	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	0.000	208.460	164.000	-	-	-
random	baseline	-	-	-	-	-	-	-	-	-	-	-	-	9999.000	0.046	4.289	0.649	16.000	9999.000	0.080	6.495	0.576	64.000	9999.000	0.012	6.610	0.255	16.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
random	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	9999.000	0.012	6.610	0.255	16.000	9999.000	0.011	6.580	0.252	8.000	9999.000	0.010	6.101	0.288	64.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
topk_margin	baseline	0.322	256.000	500.000	0.390	256.000	164.000	237.050	0.112	5.926	0.025	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
topk_margin	baseline	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
topk_margin	baseline	-	-	-	-	-	-	237.050	0.112	5.926	0.025	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
topk_margin	baseline	-	-	-	0.390	256.000	164.000	237.050	0.112	5.926	0.025	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
anthropic/claude-opus-4.6	vanilla	0.284	57.430	500.000	0.378	56.260	164.000	39.504	0.085	6.026	0.034	40.820	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
deepseek-reasoner	vanilla	0.328	256.000	500.000	0.415	256.000	164.000	221.498	0.055	5.276	0.015	224.000	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
google/gemini-3.1-pro-preview	vanilla	0.318	118.310	500.000	0.402	91.450	164.000	12.428	0.094	4.328	0.080	49.090	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
openai/gpt-5.4	vanilla	0.308	83.040	500.000	0.378	71.680	164.000	34.420	0.093	6.232	0.004	35.320	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
qwen/qwen3.6-plus	vanilla	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
anthropic/claude-opus-4.6	agent	0.304	121.730	500.000	0.402	89.390	164.000	28.442	0.230	6.134	0.013	35.230	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
deepseek-reasoner	agent	0.290	114.620	500.000	0.378	74.800	164.000	26.749	0.210	6.031	0.007	26.720	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
google/gemini-3.1-pro-preview	agent	0.318	118.310	500.000	0.402	91.450	164.000	12.428	0.094	4.328	0.080	49.090	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
openai/gpt-5.4	agent	0.336	149.510	500.000	0.378	137.630	164.000	27.182	0.102	6.216	0.004	31.980	256.000	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
qwen/qwen3.6-plus	agent	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview