llm-kv-selection-budgeting

ML SystemsFastKVrigorous codebase

Description

llm-kv-selection-budgeting

Design a workload-aware KV selection and eviction controller inside a shared FastKV benchmark harness.

Background

Transformer-based LLMs cache key-value (KV) tensors for every attention layer during autoregressive generation. At long context lengths this KV cache dominates GPU memory and forward-pass latency. A large family of methods — SnapKV, H2O, StreamingLLM, FastKV, and others — reduce the cache by selecting a subset of historical tokens per layer while keeping a small uncompressed "recent window". These methods differ in (a) how they score and pool historical tokens, (b) how large the recent window is, and (c) whether later layers receive a progressively amplified retention budget.

This task evaluates KV selection-and-budgeting policies inside a shared FastKV runtime harness, measuring the tradeoff between workload quality, inference latency, and effective KV memory footprint.

Task

Modify the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py (lines 573–595). Implement the four policy methods to control how the shared FastKV runtime allocates KV budget across layers and workloads. The harness translates your semantic policy into the closest official FastKV integration path, runs a reference fullkv pass and a compressed candidate pass, then reports quality, latency, and memory metrics.

Research Question

Given a workload slice and a shared FastKV runtime surface, can you design:

  • how much KV budget each layer receives
  • how large the recent uncompressed window should be
  • how historical tokens should be selected for retention
  • whether later layers should receive an amplified retention schedule

so that workload-native quality is preserved while latency and estimated KV memory improve relative to fullkv?

This task does not ask you to pick a named baseline family directly. It asks you to implement a single selection/eviction controller cut. The fixed harness then translates that semantic controller into the closest official FastKV integration path.

What You May Edit

You may edit only the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py.

The editable methods are:

  • layer_budget_ratios(workload_meta, num_layers)
  • recent_window_tokens(workload_meta, num_layers)
  • history_selector(workload_meta)
  • late_layer_schedule(workload_meta, num_layers)

The harness fixes the outer contract:

  • exact FastKV monkeypatch/runtime path
  • Hugging Face generate(..., use_cache=True) execution
  • a fullkv internal reference run
  • benchmark-family data loading and scoring
  • source-native workload scoring and fixed reporting

The controller cut is the maximum common denominator of the visible integrated FastKV baselines:

  • per-layer retention budget ratio
  • recent uncompressed window
  • historical token selector mode
  • optional late-layer amplification schedule

The harness infers the runtime family from those semantics and then applies the closest official FastKV integration path with post-init per-layer overrides. This keeps baselines and agent implementations on one editable cut.

Evaluation

This is a runtime benchmark on top of the official FastKV codebase, not an offline replay proxy.

The canonical default model for this task is:

  • mistralai/Mistral-Nemo-Instruct-2407

The runtime still allows overrides via SELECTION_KV_MODEL or --model, but benchmark-facing comparisons should use one shared model across all baselines, and the intended canonical anchor is now Mistral-Nemo rather than the earlier lightweight TinyLlama placeholder.

By default, the task prefers the real benchmark assets under FastKV/data/ when they are present in the package worktree. The checked-in task-local samples remain only as a portability fallback for dry-runs or lightweight local validation.

The benchmark inputs come from the same public assets used by the current FastKV benchmark line:

  • LongBench
    • hotpotqa_e
    • passage_retrieval_en_e
    • repobench_p_e
  • Needle in a Haystack
    • PaulGrahamEssays

Visible workloads:

  • longbench_hotpotqa
  • longbench_passage_retrieval
  • longbench_repobench
  • needle_paulgraham

Auxiliary workload retained in code but not canonical:

  • gsm8k_reasoning
    • useful as a long-output stress test
    • currently too weakly discriminative to remain in the canonical leaderboard

The runtime executes:

  1. an unmodified fullkv reference generation pass
  2. a patched candidate generation pass using the runtime family inferred from the controller semantics

The runtime emits a trace line showing whether each workload used:

  • data_source=package
  • or data_source=sample

Canonical remote evaluation should use data_source=package for all visible workloads.

Baselines

Visible canonical baselines are source-backed methods implemented on the shared selection/eviction cut and executed through the official FastKV integration path:

  • fullkv
  • snapkv
  • h2o
  • streamingllm
  • fastkv

Notes:

  • fullkv
    • direct
    • official uncompressed anchor from the FastKV repository
    • represented by a none selector and full budget ratios
  • fastkv
    • direct
    • official FastKV method with the public benchmark defaults
    • represented by pooled historical selection plus a late-layer amplification schedule
  • snapkv
    • partial
    • the runtime path is the official FastKV benchmark integration, while this task restores the original SnapKV default recipe (window_size=32, kernel_size=5, pooling=avgpool) and uses the shared proportional-retention mode
  • h2o
    • family-level
    • this task uses the official FastKV H2O integration path and the public benchmark defaults; it does not claim parity with the original H2O runtime
  • streamingllm
    • family-level
    • this task uses the official FastKV StreamingLLM integration path and the public benchmark defaults; it does not claim parity with the original StreamingLLM runtime
  • RocketKV and R-KV are intentionally excluded from the canonical visible set because their official open implementations live on different decode-time or two-stage runtimes and have not been ported faithfully into this shared harness.
  • This task should therefore be read as a FastKV-grounded retention-controller benchmark over a small set of methods that can share one Hugging Face runtime honestly, not a universal benchmark over every KV compression paper.

Metrics

The parser expects:

  • quality_main
  • quality_delta_vs_ref
  • mean_forward_latency_s
  • speedup_vs_ref
  • budget_utilization
  • peak_kv_memory_mb
  • constraint_violation_rate

budget_utilization reports the realized average KV retention ratio under the source-native runtime semantics:

  • 1.0 means full-context retention
  • values below 1.0 mean compression
  • source-faithful proportional-retention methods are therefore measured by the ratio they actually realize, not by a forced absolute token-cap reinterpretation

constraint_violation_rate is reserved for true contract violations:

  • malformed override combinations
  • invalid recent-window settings
  • or real overflow in constant-cap mode
  • runtime OOM on a workload, which is treated as a hard failure for that workload and recorded as a zero-score result rather than aborting the whole baseline

Source-faithful proportional-retention baselines are not marked violating simply because their official memory schedule does not match an external absolute token cap. This task is therefore a source-native retention-controller benchmark, not a strict matched-absolute-cap budgeting benchmark.

When a workload hits OOM:

  • quality_main is forced to 0
  • speedup_vs_ref is forced to 0
  • constraint_violation_rate is forced to 1
  • the workload still emits a parsable row so the full baseline can finish and land in the leaderboard

All visible baselines are expected to emit all four visible workloads. If a baseline cannot survive a workload regime, that workload remains in the artifact and receives the OOM-to-zero failure record rather than being dropped from the row.

The harness still emits diagnostic budget-comparison evidence against the workload token cap:

  • diag_budget_stretch_vs_token_cap
  • diag_avg_effective_capacity_tokens
  • diag_avg_prompt_tokens

These diagnostics are useful for fairness analysis, but they are not primary leaderboard metrics.

Scoring is workload-native rather than a single shared heuristic:

  • hotpotqa uses the upstream LongBench QA F1 scorer
  • passage_retrieval uses the upstream retrieval scorer
  • repobench uses the upstream code similarity scorer
  • needle uses the upstream Rouge-1 retrieval score

quality_main keeps the source family scale:

  • LongBench stays on the same percent-style scale as upstream reporting
  • Needle stays on the upstream Rouge-style retrieval scale used by the checked-in Needle harness

fullkv is both the internal reference path for quality_delta_vs_ref and a visible anchor baseline.

Code

custom_budget_eval.py
EditableRead-only
1"""Runtime harness for llm-kv-selection-budgeting on exact FastKV methods."""
2
3from __future__ import annotations
4
5import argparse
6import gc
7import json
8import os
9import random
10import re
11import sys
12import time
13from pathlib import Path
14from types import SimpleNamespace
15

Results

ModelTypequality main longbench-hotpotqa quality delta vs ref longbench-hotpotqa speedup vs ref longbench-hotpotqa budget utilization longbench-hotpotqa peak kv memory mb longbench-hotpotqa constraint violation rate longbench-hotpotqa quality main longbench-passage-retrieval quality delta vs ref longbench-passage-retrieval speedup vs ref longbench-passage-retrieval budget utilization longbench-passage-retrieval peak kv memory mb longbench-passage-retrieval constraint violation rate longbench-passage-retrieval quality main longbench-repobench quality delta vs ref longbench-repobench speedup vs ref longbench-repobench budget utilization longbench-repobench peak kv memory mb longbench-repobench constraint violation rate longbench-repobench quality main needle-paulgraham quality delta vs ref needle-paulgraham speedup vs ref needle-paulgraham budget utilization needle-paulgraham peak kv memory mb needle-paulgraham constraint violation rate needle-paulgraham
fastkvbaseline41.6670.0001.7330.340287.3440.000100.0000.0001.7360.340320.0780.00045.500-1.0001.3800.340540.1170.0008.1820.0001.2970.340220.9380.000
fullkvbaseline41.6670.0001.7411.000845.5080.000100.0000.0001.0271.000941.6020.00046.5000.0001.1341.0001588.6720.0008.1820.0001.0291.000650.0000.000
h2obaseline41.6670.0000.6630.10084.3750.000100.0000.0000.5510.10094.1410.00045.333-1.1670.7160.100158.7890.0005.247-2.9350.8780.10064.8440.000
snapkvbaseline41.6670.0001.1200.10084.3750.000100.0000.0001.1070.10094.1410.00046.5000.0001.0610.100158.7890.0007.597-0.5851.0780.10064.8440.000
streamingllmbaseline41.6670.0001.1170.10084.3750.000100.0000.0001.0380.10094.1410.00045.500-1.0001.3580.100158.7890.0002.576-5.6061.3880.10064.8440.000