llm-kv-adaptive-quantization

ML Systemstransformers-kv-labrigorous codebase

Description

LLM KV Cache: Adaptive Quantization Policy

Research Question

Design an adaptive KV-cache quantization policy for decoder-only LLM inference on top of a tensor-level Transformers replay harness. The primary research focus is bit allocation and residual-preservation under a fixed replay contract. Your policy should reduce KV-cache memory while preserving output quality across frozen benchmark slices.

What You Can Modify

The editable region is the AdaptiveQuantPolicy class in custom_quant_eval.py. The core policy problem is:

choosing bit-widths for keys and values
choosing the quantization axis
choosing the residual full-precision window
selecting a calibration mode for the current workload

The policy must keep these core methods:

choose_bits(layer_id, kv_kind, head_group, token_stats, budget_state) -> int
choose_axis(layer_id, kv_kind) -> str
residual_length(layer_id, request_meta) -> int
calibration_mode(workload_meta, budget_state) -> str

The replay harness also supports these optional advanced hooks so that source-backed overlap baselines can be represented faithfully. If a policy does not define them, the harness uses fixed defaults:

quantizer_family(layer_id, kv_kind, request_meta) -> str
group_size(layer_id, kv_kind, request_meta) -> int
sink_tokens(layer_id, kv_kind, request_meta) -> int
clip_ratio(layer_id, kv_kind, request_meta) -> float
quantization_level(layer_id, kv_kind, request_meta) -> str
quantization_method(layer_id, kv_kind, request_meta) -> str
symmetric(layer_id, kv_kind, request_meta) -> bool
outliers_ratio(layer_id, kv_kind, request_meta) -> float
use_attentions(layer_id, kv_kind, request_meta) -> bool
bit_range(layer_id, kv_kind, request_meta, budget_state) -> tuple[int, int]
last_n_attentions(layer_id, kv_kind, request_meta) -> int
target_quantization_error(layer_id, kv_kind, request_meta) -> float
q_norm(layer_id, kv_kind, request_meta) -> float

What You Cannot Modify

The model family and deterministic decode replay loop
The benchmark workload presets
The parser or evaluation metric definitions
The underlying Transformers model implementation

Evaluation

This task currently uses a tensor-level Transformers replay loop with:

visible host model: Qwen/Qwen2.5-0.5B-Instruct
visible backend: DynamicCache snapshots with task-local tensor quantizers
external-validity-only follow-up: no packed-cache runtime claims inside the canonical leaderboard

The visible replay exposes frozen benchmark slices from three public benchmark families:

longbench_slice: excerpted LongBench long-context QA / retrieval
needle_slice: excerpted passkey retrieval
reasoning_slice: excerpted benchmark-style STEM reasoning QA

Benchmark provenance for the frozen slices is recorded in benchmarks/README.md.

The visible test_cmds currently run:

longbench-slice
needle-slice
reasoning-slice

The parser expects TEST_METRICS: lines with at least these fields:

quality_main
quality_delta_vs_ref
peak_kv_memory_mb
decode_tokens_per_s
prefill_latency_ms
calibration_cost_s
avg_bits_per_kv_token
cache_error
greedy_match_ratio

Notes

The harness runs a real deterministic replay over Transformers decode steps.
At each replay step it snapshots the real KV tensors, quantizes them with the current policy, restores the quantized cache, and replays the next token prediction.
quality_main is not a pure exact-match score. It combines:
- real next-token replay under quantized caches
- a cache-distortion proxy measured on the quantized KV tensors
peak_kv_memory_mb is a KV-specific memory estimate, not raw device-wide CUDA peak.
All visible benchmark commands run sequentially by default.
Canonical visible baselines are now:
- uniform 4-bit KV quantization
- uniform 2-bit KV quantization
- source-backed kivi_overlap_2bit
- source-backed kivi_overlap_4bit
- source-backed skvq_overlap_2bit
The retained SOTA-family anchor under the current contract is kivi_overlap_4bit. The upstream KIVI repo reports that on LongBench:
- for LongChat-7B-32K, KIVI-4 slightly exceeds the full-precision average (38.79 vs 38.72)
- for Mistral-7B-Instruct-v0.2, KIVI-4 essentially matches full precision (43.53 vs 43.54)
You may use the exposed Transformers source tree for reference, especially src/transformers/cache_utils.py.
The raw bash scripts/*.sh commands assume mid_edit has already materialized transformers-kv-lab/custom_quant_eval.py in the workspace package root.
The visible scripts currently impose a uniform upper budget cap of 4 bits per KV entry; lower-bit baselines are evaluated as stricter policies under that shared cap.
kivi_overlap_2bit and kivi_overlap_4bit map to the official repo's public defaults, but they do not reproduce KIVI's packed-cache kernels.
skvq_overlap_2bit reuses SKVQ's public grouped-quantization, sink-token, window, and clipping primitives, but not reorder/pre-RoPE/custom-kernel paths.
Other audited repos such as QAQ and KVQuant are intentionally left out of the canonical visible set because their runtime assumptions still exceed the current contract.

Code

custom_quant_eval.py

EditableRead-only

1"""Tensor-level KV-cache quantization replay harness.
2
3This scaffold replays deterministic decode steps on top of Hugging Face
4Transformers. Instead of collapsing the policy into one global
5QuantizedCacheConfig, it snapshots real KV tensors, quantizes them with
6source-backed overlap rules, and replays the next decode step with the
7quantized cache.
8"""
9
10from __future__ import annotations
11
12import argparse
13import json
14import math
15import os

Additional context files (read-only):

transformers-kv-lab/src/transformers/cache_utils.py

Results

No results available yet.