llm-kv-adaptive-quantization

ML Systemstransformers-kv-labrigorous codebase

Description

LLM KV Cache: Adaptive Quantization Policy

Research Question

Design an adaptive KV-cache quantization policy for decoder-only LLM inference on top of a tensor-level Transformers replay harness. The primary research focus is bit allocation and residual-preservation under a fixed replay contract. Your policy should reduce KV-cache memory while preserving output quality across frozen benchmark slices.

What You Can Modify

The editable region is the AdaptiveQuantPolicy class in custom_quant_eval.py. The core policy problem is:

  • choosing bit-widths for keys and values
  • choosing the quantization axis
  • choosing the residual full-precision window
  • selecting a calibration mode for the current workload

The policy must keep these core methods:

  • choose_bits(layer_id, kv_kind, head_group, token_stats, budget_state) -> int
  • choose_axis(layer_id, kv_kind) -> str
  • residual_length(layer_id, request_meta) -> int
  • calibration_mode(workload_meta, budget_state) -> str

The replay harness also supports these optional advanced hooks so that source-backed overlap baselines can be represented faithfully. If a policy does not define them, the harness uses fixed defaults:

  • quantizer_family(layer_id, kv_kind, request_meta) -> str
  • group_size(layer_id, kv_kind, request_meta) -> int
  • sink_tokens(layer_id, kv_kind, request_meta) -> int
  • clip_ratio(layer_id, kv_kind, request_meta) -> float
  • quantization_level(layer_id, kv_kind, request_meta) -> str
  • quantization_method(layer_id, kv_kind, request_meta) -> str
  • symmetric(layer_id, kv_kind, request_meta) -> bool
  • outliers_ratio(layer_id, kv_kind, request_meta) -> float
  • use_attentions(layer_id, kv_kind, request_meta) -> bool
  • bit_range(layer_id, kv_kind, request_meta, budget_state) -> tuple[int, int]
  • last_n_attentions(layer_id, kv_kind, request_meta) -> int
  • target_quantization_error(layer_id, kv_kind, request_meta) -> float
  • q_norm(layer_id, kv_kind, request_meta) -> float

What You Cannot Modify

  • The model family and deterministic decode replay loop
  • The benchmark workload presets
  • The parser or evaluation metric definitions
  • The underlying Transformers model implementation

Evaluation

This task currently uses a tensor-level Transformers replay loop with:

  • visible host model: Qwen/Qwen2.5-0.5B-Instruct
  • visible backend: DynamicCache snapshots with task-local tensor quantizers
  • external-validity-only follow-up: no packed-cache runtime claims inside the canonical leaderboard

The visible replay exposes frozen benchmark slices from three public benchmark families:

  • longbench_slice: excerpted LongBench long-context QA / retrieval
  • needle_slice: excerpted passkey retrieval
  • reasoning_slice: excerpted benchmark-style STEM reasoning QA

Benchmark provenance for the frozen slices is recorded in benchmarks/README.md.

The visible test_cmds currently run:

  • longbench-slice
  • needle-slice
  • reasoning-slice

The parser expects TEST_METRICS: lines with at least these fields:

  • quality_main
  • quality_delta_vs_ref
  • peak_kv_memory_mb
  • decode_tokens_per_s
  • prefill_latency_ms
  • calibration_cost_s
  • avg_bits_per_kv_token
  • cache_error
  • greedy_match_ratio

Notes

  • The harness runs a real deterministic replay over Transformers decode steps.
  • At each replay step it snapshots the real KV tensors, quantizes them with the current policy, restores the quantized cache, and replays the next token prediction.
  • quality_main is not a pure exact-match score. It combines:
    • real next-token replay under quantized caches
    • a cache-distortion proxy measured on the quantized KV tensors
  • peak_kv_memory_mb is a KV-specific memory estimate, not raw device-wide CUDA peak.
  • All visible benchmark commands run sequentially by default.
  • Canonical visible baselines are now:
    • uniform 4-bit KV quantization
    • uniform 2-bit KV quantization
    • source-backed kivi_overlap_2bit
    • source-backed kivi_overlap_4bit
    • source-backed skvq_overlap_2bit
  • The retained SOTA-family anchor under the current contract is kivi_overlap_4bit. The upstream KIVI repo reports that on LongBench:
    • for LongChat-7B-32K, KIVI-4 slightly exceeds the full-precision average (38.79 vs 38.72)
    • for Mistral-7B-Instruct-v0.2, KIVI-4 essentially matches full precision (43.53 vs 43.54)
  • You may use the exposed Transformers source tree for reference, especially src/transformers/cache_utils.py.
  • The raw bash scripts/*.sh commands assume mid_edit has already materialized transformers-kv-lab/custom_quant_eval.py in the workspace package root.
  • The visible scripts currently impose a uniform upper budget cap of 4 bits per KV entry; lower-bit baselines are evaluated as stricter policies under that shared cap.
  • kivi_overlap_2bit and kivi_overlap_4bit map to the official repo's public defaults, but they do not reproduce KIVI's packed-cache kernels.
  • skvq_overlap_2bit reuses SKVQ's public grouped-quantization, sink-token, window, and clipping primitives, but not reorder/pre-RoPE/custom-kernel paths.
  • Other audited repos such as QAQ and KVQuant are intentionally left out of the canonical visible set because their runtime assumptions still exceed the current contract.

Code

custom_quant_eval.py
EditableRead-only
1"""Tensor-level KV-cache quantization replay harness.
2
3This scaffold replays deterministic decode steps on top of Hugging Face
4Transformers. Instead of collapsing the policy into one global
5QuantizedCacheConfig, it snapshots real KV tensors, quantizes them with
6source-backed overlap rules, and replays the next decode step with the
7quantized cache.
8"""
9
10from __future__ import annotations
11
12import argparse
13import json
14import math
15import os

Additional context files (read-only):

  • transformers-kv-lab/src/transformers/cache_utils.py

Results

No results available yet.