llm-kv-adaptive-quantization
Description
LLM KV Cache: Adaptive Quantization Policy
Research Question
Design an adaptive KV-cache quantization policy for decoder-only LLM inference on top of a tensor-level Transformers replay harness. The primary research focus is bit allocation and residual-preservation under a fixed replay contract. Your policy should reduce KV-cache memory while preserving output quality across frozen benchmark slices.
What You Can Modify
The editable region is the AdaptiveQuantPolicy class in custom_quant_eval.py. The core policy problem is:
- choosing bit-widths for keys and values
- choosing the quantization axis
- choosing the residual full-precision window
- selecting a calibration mode for the current workload
The policy must keep these core methods:
choose_bits(layer_id, kv_kind, head_group, token_stats, budget_state) -> intchoose_axis(layer_id, kv_kind) -> strresidual_length(layer_id, request_meta) -> intcalibration_mode(workload_meta, budget_state) -> str
The replay harness also supports these optional advanced hooks so that source-backed overlap baselines can be represented faithfully. If a policy does not define them, the harness uses fixed defaults:
quantizer_family(layer_id, kv_kind, request_meta) -> strgroup_size(layer_id, kv_kind, request_meta) -> intsink_tokens(layer_id, kv_kind, request_meta) -> intclip_ratio(layer_id, kv_kind, request_meta) -> floatquantization_level(layer_id, kv_kind, request_meta) -> strquantization_method(layer_id, kv_kind, request_meta) -> strsymmetric(layer_id, kv_kind, request_meta) -> booloutliers_ratio(layer_id, kv_kind, request_meta) -> floatuse_attentions(layer_id, kv_kind, request_meta) -> boolbit_range(layer_id, kv_kind, request_meta, budget_state) -> tuple[int, int]last_n_attentions(layer_id, kv_kind, request_meta) -> inttarget_quantization_error(layer_id, kv_kind, request_meta) -> floatq_norm(layer_id, kv_kind, request_meta) -> float
What You Cannot Modify
- The model family and deterministic decode replay loop
- The benchmark workload presets
- The parser or evaluation metric definitions
- The underlying
Transformersmodel implementation
Evaluation
This task currently uses a tensor-level Transformers replay loop with:
- visible host model:
Qwen/Qwen2.5-0.5B-Instruct - visible backend:
DynamicCachesnapshots with task-local tensor quantizers - external-validity-only follow-up: no packed-cache runtime claims inside the canonical leaderboard
The visible replay exposes frozen benchmark slices from three public benchmark families:
longbench_slice: excerpted LongBench long-context QA / retrievalneedle_slice: excerpted passkey retrievalreasoning_slice: excerpted benchmark-style STEM reasoning QA
Benchmark provenance for the frozen slices is recorded in benchmarks/README.md.
The visible test_cmds currently run:
longbench-sliceneedle-slicereasoning-slice
The parser expects TEST_METRICS: lines with at least these fields:
quality_mainquality_delta_vs_refpeak_kv_memory_mbdecode_tokens_per_sprefill_latency_mscalibration_cost_savg_bits_per_kv_tokencache_errorgreedy_match_ratio
Notes
- The harness runs a real deterministic replay over
Transformersdecode steps. - At each replay step it snapshots the real KV tensors, quantizes them with the current policy, restores the quantized cache, and replays the next token prediction.
quality_mainis not a pure exact-match score. It combines:- real next-token replay under quantized caches
- a cache-distortion proxy measured on the quantized KV tensors
peak_kv_memory_mbis a KV-specific memory estimate, not raw device-wide CUDA peak.- All visible benchmark commands run sequentially by default.
- Canonical visible baselines are now:
- uniform 4-bit KV quantization
- uniform 2-bit KV quantization
- source-backed
kivi_overlap_2bit - source-backed
kivi_overlap_4bit - source-backed
skvq_overlap_2bit
- The retained SOTA-family anchor under the current contract is
kivi_overlap_4bit. The upstream KIVI repo reports that on LongBench:- for LongChat-7B-32K,
KIVI-4slightly exceeds the full-precision average (38.79vs38.72) - for Mistral-7B-Instruct-v0.2,
KIVI-4essentially matches full precision (43.53vs43.54)
- for LongChat-7B-32K,
- You may use the exposed
Transformerssource tree for reference, especiallysrc/transformers/cache_utils.py. - The raw
bash scripts/*.shcommands assumemid_edithas already materializedtransformers-kv-lab/custom_quant_eval.pyin the workspace package root. - The visible scripts currently impose a uniform upper budget cap of
4bits per KV entry; lower-bit baselines are evaluated as stricter policies under that shared cap. kivi_overlap_2bitandkivi_overlap_4bitmap to the official repo's public defaults, but they do not reproduce KIVI's packed-cache kernels.skvq_overlap_2bitreuses SKVQ's public grouped-quantization, sink-token, window, and clipping primitives, but not reorder/pre-RoPE/custom-kernel paths.- Other audited repos such as
QAQandKVQuantare intentionally left out of the canonical visible set because their runtime assumptions still exceed the current contract.
Code
1"""Tensor-level KV-cache quantization replay harness.23This scaffold replays deterministic decode steps on top of Hugging Face4Transformers. Instead of collapsing the policy into one global5QuantizedCacheConfig, it snapshots real KV tensors, quantizes them with6source-backed overlap rules, and replays the next decode step with the7quantized cache.8"""910from __future__ import annotations1112import argparse13import json14import math15import os
Additional context files (read-only):
transformers-kv-lab/src/transformers/cache_utils.py
Results
No results available yet.