llm-kv-selection-budgeting
Description
llm-kv-selection-budgeting
Design a workload-aware KV selection and eviction controller inside a shared FastKV benchmark harness.
Background
Transformer-based LLMs cache key-value (KV) tensors for every attention layer during autoregressive generation. At long context lengths this KV cache dominates GPU memory and forward-pass latency. A large family of methods — SnapKV, H2O, StreamingLLM, FastKV, and others — reduce the cache by selecting a subset of historical tokens per layer while keeping a small uncompressed "recent window". These methods differ in (a) how they score and pool historical tokens, (b) how large the recent window is, and (c) whether later layers receive a progressively amplified retention budget.
This task evaluates KV selection-and-budgeting policies inside a shared FastKV runtime harness, measuring the tradeoff between workload quality, inference latency, and effective KV memory footprint.
Task
Modify the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py (lines 573–595). Implement the four policy methods to control how the shared FastKV runtime allocates KV budget across layers and workloads. The harness translates your semantic policy into the closest official FastKV integration path, runs a reference fullkv pass and a compressed candidate pass, then reports quality, latency, and memory metrics.
Research Question
Given a workload slice and a shared FastKV runtime surface, can you design:
- how much KV budget each layer receives
- how large the recent uncompressed window should be
- how historical tokens should be selected for retention
- whether later layers should receive an amplified retention schedule
so that workload-native quality is preserved while latency and estimated KV memory improve relative to fullkv?
This task does not ask you to pick a named baseline family directly. It asks you to implement a single selection/eviction controller cut. The fixed harness then translates that semantic controller into the closest official FastKV integration path.
What You May Edit
You may edit only the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py.
The editable methods are:
layer_budget_ratios(workload_meta, num_layers)recent_window_tokens(workload_meta, num_layers)history_selector(workload_meta)late_layer_schedule(workload_meta, num_layers)
The harness fixes the outer contract:
- exact
FastKVmonkeypatch/runtime path - Hugging Face
generate(..., use_cache=True)execution - a
fullkvinternal reference run - benchmark-family data loading and scoring
- source-native workload scoring and fixed reporting
The controller cut is the maximum common denominator of the visible integrated
FastKV baselines:
- per-layer retention budget ratio
- recent uncompressed window
- historical token selector mode
- optional late-layer amplification schedule
The harness infers the runtime family from those semantics and then applies the
closest official FastKV integration path with post-init per-layer overrides.
This keeps baselines and agent implementations on one editable cut.
Evaluation
This is a runtime benchmark on top of the official FastKV codebase, not an offline replay proxy.
The canonical default model for this task is:
mistralai/Mistral-Nemo-Instruct-2407
The runtime still allows overrides via SELECTION_KV_MODEL or --model, but benchmark-facing comparisons should use one shared model across all baselines, and the intended canonical anchor is now Mistral-Nemo rather than the earlier lightweight TinyLlama placeholder.
By default, the task prefers the real benchmark assets under FastKV/data/ when they are present in the package worktree. The checked-in task-local samples remain only as a portability fallback for dry-runs or lightweight local validation.
The benchmark inputs come from the same public assets used by the current FastKV benchmark line:
LongBenchhotpotqa_epassage_retrieval_en_erepobench_p_e
Needle in a HaystackPaulGrahamEssays
Visible workloads:
longbench_hotpotqalongbench_passage_retrievallongbench_repobenchneedle_paulgraham
Auxiliary workload retained in code but not canonical:
gsm8k_reasoning- useful as a long-output stress test
- currently too weakly discriminative to remain in the canonical leaderboard
The runtime executes:
- an unmodified
fullkvreference generation pass - a patched candidate generation pass using the runtime family inferred from the controller semantics
The runtime emits a trace line showing whether each workload used:
data_source=package- or
data_source=sample
Canonical remote evaluation should use data_source=package for all visible workloads.
Baselines
Visible canonical baselines are source-backed methods implemented on the shared selection/eviction cut and executed through the official FastKV integration path:
fullkvsnapkvh2ostreamingllmfastkv
Notes:
fullkvdirect- official uncompressed anchor from the
FastKVrepository - represented by a
noneselector and full budget ratios
fastkvdirect- official
FastKVmethod with the public benchmark defaults - represented by pooled historical selection plus a late-layer amplification schedule
snapkvpartial- the runtime path is the official
FastKVbenchmark integration, while this task restores the originalSnapKVdefault recipe (window_size=32,kernel_size=5,pooling=avgpool) and uses the shared proportional-retention mode
h2ofamily-level- this task uses the official
FastKVH2O integration path and the public benchmark defaults; it does not claim parity with the originalH2Oruntime
streamingllmfamily-level- this task uses the official
FastKVStreamingLLM integration path and the public benchmark defaults; it does not claim parity with the originalStreamingLLMruntime
RocketKVandR-KVare intentionally excluded from the canonical visible set because their official open implementations live on different decode-time or two-stage runtimes and have not been ported faithfully into this shared harness.- This task should therefore be read as a
FastKV-grounded retention-controller benchmark over a small set of methods that can share one Hugging Face runtime honestly, not a universal benchmark over every KV compression paper.
Metrics
The parser expects:
quality_mainquality_delta_vs_refmean_forward_latency_sspeedup_vs_refbudget_utilizationpeak_kv_memory_mbconstraint_violation_rate
budget_utilization reports the realized average KV retention ratio under the source-native runtime semantics:
1.0means full-context retention- values below
1.0mean compression - source-faithful proportional-retention methods are therefore measured by the ratio they actually realize, not by a forced absolute token-cap reinterpretation
constraint_violation_rate is reserved for true contract violations:
- malformed override combinations
- invalid recent-window settings
- or real overflow in constant-cap mode
- runtime OOM on a workload, which is treated as a hard failure for that workload and recorded as a zero-score result rather than aborting the whole baseline
Source-faithful proportional-retention baselines are not marked violating simply because their official memory schedule does not match an external absolute token cap. This task is therefore a source-native retention-controller benchmark, not a strict matched-absolute-cap budgeting benchmark.
When a workload hits OOM:
quality_mainis forced to0speedup_vs_refis forced to0constraint_violation_rateis forced to1- the workload still emits a parsable row so the full baseline can finish and land in the leaderboard
All visible baselines are expected to emit all four visible workloads. If a baseline cannot survive a workload regime, that workload remains in the artifact and receives the OOM-to-zero failure record rather than being dropped from the row.
The harness still emits diagnostic budget-comparison evidence against the workload token cap:
diag_budget_stretch_vs_token_capdiag_avg_effective_capacity_tokensdiag_avg_prompt_tokens
These diagnostics are useful for fairness analysis, but they are not primary leaderboard metrics.
Scoring is workload-native rather than a single shared heuristic:
hotpotqauses the upstream LongBench QA F1 scorerpassage_retrievaluses the upstream retrieval scorerrepobenchuses the upstream code similarity scorerneedleuses the upstream Rouge-1 retrieval score
quality_main keeps the source family scale:
- LongBench stays on the same percent-style scale as upstream reporting
- Needle stays on the upstream Rouge-style retrieval scale used by the checked-in Needle harness
fullkv is both the internal reference path for quality_delta_vs_ref and a visible anchor baseline.
Code
1"""Runtime harness for llm-kv-selection-budgeting on exact FastKV methods."""23from __future__ import annotations45import argparse6import gc7import json8import os9import random10import re11import sys12import time13from pathlib import Path14from types import SimpleNamespace15
Results
| Model | Type | quality main longbench-hotpotqa ↑ | quality delta vs ref longbench-hotpotqa ↑ | speedup vs ref longbench-hotpotqa ↑ | budget utilization longbench-hotpotqa ↑ | peak kv memory mb longbench-hotpotqa ↓ | constraint violation rate longbench-hotpotqa ↓ | quality main longbench-passage-retrieval ↑ | quality delta vs ref longbench-passage-retrieval ↑ | speedup vs ref longbench-passage-retrieval ↑ | budget utilization longbench-passage-retrieval ↑ | peak kv memory mb longbench-passage-retrieval ↓ | constraint violation rate longbench-passage-retrieval ↓ | quality main longbench-repobench ↑ | quality delta vs ref longbench-repobench ↑ | speedup vs ref longbench-repobench ↑ | budget utilization longbench-repobench ↑ | peak kv memory mb longbench-repobench ↓ | constraint violation rate longbench-repobench ↓ | quality main needle-paulgraham ↑ | quality delta vs ref needle-paulgraham ↑ | speedup vs ref needle-paulgraham ↑ | budget utilization needle-paulgraham ↑ | peak kv memory mb needle-paulgraham ↓ | constraint violation rate needle-paulgraham ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fastkv | baseline | 41.667 | 0.000 | 1.733 | 0.340 | 287.344 | 0.000 | 100.000 | 0.000 | 1.736 | 0.340 | 320.078 | 0.000 | 45.500 | -1.000 | 1.380 | 0.340 | 540.117 | 0.000 | 8.182 | 0.000 | 1.297 | 0.340 | 220.938 | 0.000 |
| fullkv | baseline | 41.667 | 0.000 | 1.741 | 1.000 | 845.508 | 0.000 | 100.000 | 0.000 | 1.027 | 1.000 | 941.602 | 0.000 | 46.500 | 0.000 | 1.134 | 1.000 | 1588.672 | 0.000 | 8.182 | 0.000 | 1.029 | 1.000 | 650.000 | 0.000 |
| h2o | baseline | 41.667 | 0.000 | 0.663 | 0.100 | 84.375 | 0.000 | 100.000 | 0.000 | 0.551 | 0.100 | 94.141 | 0.000 | 45.333 | -1.167 | 0.716 | 0.100 | 158.789 | 0.000 | 5.247 | -2.935 | 0.878 | 0.100 | 64.844 | 0.000 |
| snapkv | baseline | 41.667 | 0.000 | 1.120 | 0.100 | 84.375 | 0.000 | 100.000 | 0.000 | 1.107 | 0.100 | 94.141 | 0.000 | 46.500 | 0.000 | 1.061 | 0.100 | 158.789 | 0.000 | 7.597 | -0.585 | 1.078 | 0.100 | 64.844 | 0.000 |
| streamingllm | baseline | 41.667 | 0.000 | 1.117 | 0.100 | 84.375 | 0.000 | 100.000 | 0.000 | 1.038 | 0.100 | 94.141 | 0.000 | 45.500 | -1.000 | 1.358 | 0.100 | 158.789 | 0.000 | 2.576 | -5.606 | 1.388 | 0.100 | 64.844 | 0.000 |