dlm-dkv-policy

Deep LearningdLLM-cacherigorous codebase

Description

Diffusion LM Cache: Refresh Policy

Research Question

Design a better denoising-step refresh policy for diffusion language models. Given a fixed host model (LLaDA-8B-Instruct), fixed step budget, and fixed evaluation harness, can you decide which token states to refresh and how often to refresh them so that output quality stays high while compute and memory stay low?

This task isolates one scientific question: state reuse versus refresh scheduling inside the denoising trajectory. You may not change the scheduler, model architecture, serving stack, or decoding mode.

Evaluation Setup

The harness runs real LLaDA-8B-Instruct inference end-to-end using the dLLM-cache library. For each workload and regime the harness:

  1. Loads prompts from checked-in trace files (real public benchmark prompts from MMLU-Pro, GSM8K, MBPP, and NuminaMath-CoT).
  2. Runs a reference pass with gen_interval=1 (no caching) to establish ground-truth token sequences.
  3. Runs the policy pass with dLLM-cache hooks active, controlled by the editable DLMRefreshPolicy class.
  4. Reports token-level exact-match quality against the reference, plus efficiency metrics.

What You Can Modify

You may edit only the DLMRefreshPolicy class in dLLM-cache/custom_dlm_eval.py (lines 51–100 in the harness file).

The editable methods are:

  • refresh_mask(step_id, token_stats, budget_state)list[bool]
  • prompt_refresh_interval(step_id, request_meta)int
  • gen_refresh_interval(step_id, request_meta)int
  • transfer_ratio(step_id, request_meta, token_stats)float
  • fallback_action(step_id, quality_proxy)str

Allowed fallback_action outputs: hold, refresh_all.

Token stats available in token_stats list (each element is a dict):

FieldTypeMeaning
importancefloat 0–1per-token confidence score from model logits
stalenessfloat 0–1relative step progress (0=early, 1=late)
difficultyfloat 0–1normalized token-distribution entropy
similarityfloat 0–1cosine similarity to previous-step distribution

Budget state fields in budget_state:

FieldTypeMeaning
budget_scalefloat1.0=full, 0.70=medium, 0.48=tight
scarcity_pressurefloat1 − budget_scale

What You Cannot Modify

  • the dLLM-cache host library
  • the visible workload presets and regimes
  • metric definitions and the parser
  • the reference generation pass

Workloads and Regimes

Visible workload families:

WorkloadBenchmark SourceCharacter
general_instructionMMLU-Promulti-choice instruction following
math_reasoningGSM8Kstep-by-step arithmetic reasoning
code_generationMBPPPython function synthesis
reasoning_refresh_scarcityNuminaMath-CoTlong-horizon math with shifting token importance

Visible step regimes:

Regimebudget_scaleCharacter
full_steps1.00unconstrained
medium_steps0.70moderate cache pressure
tight_steps0.48high cache pressure

Visible test scripts run:

  • instruction-medium: general_instruction × medium_steps
  • math-tight: math_reasoning × tight_steps
  • code-tight: code_generation × tight_steps
  • reasoning-scarcity-tight: reasoning_refresh_scarcity × tight_steps

Metrics

The harness prints a TEST_METRICS: line with:

MetricDirectionMeaning
quality_main↑ highertoken-level exact match vs uncached reference (%)
reuse_ratio↑ higherfraction of gen steps where KV is reused
refresh_ratio↓ lower1 − reuse_ratio
quality_efficiency_score↑ higherquality_main × reuse_ratio (primary rank)
tokens_per_s↑ highergeneration throughput
peak_memory_mb↓ lowerpeak GPU memory usage
n_promptsnumber of prompts evaluated
eval_modealways real_rollout

Composite Ranking Metric

quality_efficiency_score = quality_main × reuse_ratio

This is the primary ranking signal. It rewards policies that simultaneously achieve high denoising quality and high KV-state reuse:

  • A policy that refreshes everything every step: reuse_ratio ≈ 0 → low score.
  • A policy that almost never refreshes: quality_main degrades → low score.
  • The optimal policy: near-reference quality at high reuse → high score.

Baselines

BaselineFamilySourceStatus
fixed_intervalfixed cadencetask-native controlanchor (weakest)
d2cacheconfidence + difficultyD²Cache-inspiredrepresentative
dllm_cache_similaritysimilarity-guideddLLM-Cache-inspiredSOTA
freecachestable-state reuseFreeCache-inspiredrepresentative
dkv_cacheimportance+staleness thresholddKV-Cache thresholdrepresentative
dkv_cache_greedyimportance+staleness top-fractiondKV-Cache greedyrepresentative

SOTA anchor: dllm_cache_similarity achieves the highest average quality_efficiency_score across the four visible workloads by using similarity-guided refresh decisions that balance quality preservation with high KV-state reuse. A new policy is considered an improvement when it beats dllm_cache_similarity on quality_efficiency_score while not regressing quality_main below the next-best baseline on any workload.

fixed_interval is the simplest baseline (fixed cadence, no token-level adaptation). It serves as a lower-bound anchor: its refresh cadence is too rigid to achieve competitive quality under tight budgets, so it scores lowest on quality_efficiency_score overall. The adaptive baselines (d2cache, freecache, dkv_cache, dkv_cache_greedy) use token-level statistics to decide when to refresh, achieving higher quality and reuse. Designing a policy that outperforms the best adaptive methods across all workloads is the key challenge.

Notes

  • The editable region is lines 51–100 in dLLM-cache/custom_dlm_eval.py.
  • mid_edit.py creates the harness file; baseline *.edit.py files apply the policy class replacement.
  • reasoning_refresh_scarcity is the key diversity workload: it features a longer reasoning trajectory where important token groups shift across steps, making it a harder policy design problem.
  • dkv_cache_greedy is the maximum-reuse / minimum-quality corner of the visible baseline Pareto front; it is not a target to beat on quality.
  • Token stats (importance, staleness, difficulty, similarity) are computed from real model logits during each denoising step.
  • The transfer_ratio controls what fraction of generation tokens undergo partial key/value transfer even when not fully refreshed.

Code

custom_dlm_eval.py
EditableRead-only
1"""Real LLaDA rollout harness for dlm-dkv-policy."""
2
3from __future__ import annotations
4
5import argparse
6import json
7import os
8import sys
9import time
10from pathlib import Path
11
12import torch
13
14# dLLM-cache path resolution
15_HERE = Path(__file__).resolve().parent

Results

No results yet.