llm-kv-selection-budgeting

ML SystemsFastKVrigorous codebase

Description

llm-kv-selection-budgeting

Design a workload-aware KV selection and eviction controller inside a shared FastKV benchmark harness.

Background

Transformer-based LLMs cache key-value (KV) tensors for every attention layer during autoregressive generation. At long context lengths this KV cache dominates GPU memory and forward-pass latency. A large family of methods — SnapKV, H2O, StreamingLLM, FastKV, and others — reduce the cache by selecting a subset of historical tokens per layer while keeping a small uncompressed "recent window". These methods differ in (a) how they score and pool historical tokens, (b) how large the recent window is, and (c) whether later layers receive a progressively amplified retention budget.

This task evaluates KV selection-and-budgeting policies inside a shared FastKV runtime harness, measuring the tradeoff between workload quality, inference latency, and effective KV memory footprint.

Task

Modify the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py (lines 573–595). Implement the four policy methods to control how the shared FastKV runtime allocates KV budget across layers and workloads. The harness translates your semantic policy into the closest official FastKV integration path, runs a reference fullkv pass and a compressed candidate pass, then reports quality, latency, and memory metrics.

Research Question

Given a workload slice and a shared FastKV runtime surface, can you design:

how much KV budget each layer receives
how large the recent uncompressed window should be
how historical tokens should be selected for retention
whether later layers should receive an amplified retention schedule

so that workload-native quality is preserved while latency and estimated KV memory improve relative to fullkv?

This task does not ask you to pick a named baseline family directly. It asks you to implement a single selection/eviction controller cut. The fixed harness then translates that semantic controller into the closest official FastKV integration path.

What You May Edit

You may edit only the SelectionBudgetPolicy class in FastKV/custom_budget_eval.py.

The editable methods are:

layer_budget_ratios(workload_meta, num_layers)
recent_window_tokens(workload_meta, num_layers)
history_selector(workload_meta)
late_layer_schedule(workload_meta, num_layers)

The harness fixes the outer contract:

exact FastKV monkeypatch/runtime path
Hugging Face generate(..., use_cache=True) execution
a fullkv internal reference run
benchmark-family data loading and scoring
source-native workload scoring and fixed reporting

The controller cut is the maximum common denominator of the visible integrated FastKV baselines:

per-layer retention budget ratio
recent uncompressed window
historical token selector mode
optional late-layer amplification schedule

The harness infers the runtime family from those semantics and then applies the closest official FastKV integration path with post-init per-layer overrides. This keeps baselines and agent implementations on one editable cut.

Evaluation

This is a runtime benchmark on top of the official FastKV codebase, not an offline replay proxy.

The canonical default model for this task is:

mistralai/Mistral-Nemo-Instruct-2407

The runtime still allows overrides via SELECTION_KV_MODEL or --model, but benchmark-facing comparisons should use one shared model across all baselines, and the intended canonical anchor is now Mistral-Nemo rather than the earlier lightweight TinyLlama placeholder.

By default, the task prefers the real benchmark assets under FastKV/data/ when they are present in the package worktree. The checked-in task-local samples remain only as a portability fallback for dry-runs or lightweight local validation.

The benchmark inputs come from the same public assets used by the current FastKV benchmark line:

LongBench
- hotpotqa_e
- passage_retrieval_en_e
- repobench_p_e
Needle in a Haystack
- PaulGrahamEssays

Visible workloads:

longbench_hotpotqa
longbench_passage_retrieval
longbench_repobench
needle_paulgraham

Auxiliary workload retained in code but not canonical:

gsm8k_reasoning
- useful as a long-output stress test
- currently too weakly discriminative to remain in the canonical leaderboard

The runtime executes:

an unmodified fullkv reference generation pass
a patched candidate generation pass using the runtime family inferred from the controller semantics

The runtime emits a trace line showing whether each workload used:

data_source=package
or data_source=sample

Canonical remote evaluation should use data_source=package for all visible workloads.

Baselines

Visible canonical baselines are source-backed methods implemented on the shared selection/eviction cut and executed through the official FastKV integration path:

fullkv
snapkv
h2o
streamingllm
fastkv

Notes:

fullkv
- direct
- official uncompressed anchor from the FastKV repository
- represented by a none selector and full budget ratios
fastkv
- direct
- official FastKV method with the public benchmark defaults
- represented by pooled historical selection plus a late-layer amplification schedule
snapkv
- partial
- the runtime path is the official FastKV benchmark integration, while this task restores the original SnapKV default recipe (window_size=32, kernel_size=5, pooling=avgpool) and uses the shared proportional-retention mode
h2o
- family-level
- this task uses the official FastKV H2O integration path and the public benchmark defaults; it does not claim parity with the original H2O runtime
streamingllm
- family-level
- this task uses the official FastKV StreamingLLM integration path and the public benchmark defaults; it does not claim parity with the original StreamingLLM runtime
RocketKV and R-KV are intentionally excluded from the canonical visible set because their official open implementations live on different decode-time or two-stage runtimes and have not been ported faithfully into this shared harness.
This task should therefore be read as a FastKV-grounded retention-controller benchmark over a small set of methods that can share one Hugging Face runtime honestly, not a universal benchmark over every KV compression paper.

Metrics

The parser expects:

quality_main
quality_delta_vs_ref
mean_forward_latency_s
speedup_vs_ref
budget_utilization
peak_kv_memory_mb
constraint_violation_rate

budget_utilization reports the realized average KV retention ratio under the source-native runtime semantics:

1.0 means full-context retention
values below 1.0 mean compression
source-faithful proportional-retention methods are therefore measured by the ratio they actually realize, not by a forced absolute token-cap reinterpretation

constraint_violation_rate is reserved for true contract violations:

malformed override combinations
invalid recent-window settings
or real overflow in constant-cap mode
runtime OOM on a workload, which is treated as a hard failure for that workload and recorded as a zero-score result rather than aborting the whole baseline

Source-faithful proportional-retention baselines are not marked violating simply because their official memory schedule does not match an external absolute token cap. This task is therefore a source-native retention-controller benchmark, not a strict matched-absolute-cap budgeting benchmark.

When a workload hits OOM:

quality_main is forced to 0
speedup_vs_ref is forced to 0
constraint_violation_rate is forced to 1
the workload still emits a parsable row so the full baseline can finish and land in the leaderboard

All visible baselines are expected to emit all four visible workloads. If a baseline cannot survive a workload regime, that workload remains in the artifact and receives the OOM-to-zero failure record rather than being dropped from the row.

The harness still emits diagnostic budget-comparison evidence against the workload token cap:

diag_budget_stretch_vs_token_cap
diag_avg_effective_capacity_tokens
diag_avg_prompt_tokens

These diagnostics are useful for fairness analysis, but they are not primary leaderboard metrics.

Scoring is workload-native rather than a single shared heuristic:

hotpotqa uses the upstream LongBench QA F1 scorer
passage_retrieval uses the upstream retrieval scorer
repobench uses the upstream code similarity scorer
needle uses the upstream Rouge-1 retrieval score

quality_main keeps the source family scale:

LongBench stays on the same percent-style scale as upstream reporting
Needle stays on the upstream Rouge-style retrieval scale used by the checked-in Needle harness

fullkv is both the internal reference path for quality_delta_vs_ref and a visible anchor baseline.

Code

custom_budget_eval.py

EditableRead-only

1"""Runtime harness for llm-kv-selection-budgeting on exact FastKV methods."""
2
3from __future__ import annotations
4
5import argparse
6import gc
7import json
8import os
9import random
10import re
11import sys
12import time
13from pathlib import Path
14from types import SimpleNamespace
15

Results

Model	Type	quality main longbench-hotpotqa ↑	speedup vs ref longbench-hotpotqa ↑	budget utilization longbench-hotpotqa ↑	peak kv memory mb longbench-hotpotqa ↓	quality main longbench-passage-retrieval ↑	speedup vs ref longbench-passage-retrieval ↑	budget utilization longbench-passage-retrieval ↑	peak kv memory mb longbench-passage-retrieval ↓	quality main longbench-repobench ↑	quality delta vs ref longbench-repobench ↑	speedup vs ref longbench-repobench ↑	budget utilization longbench-repobench ↑	peak kv memory mb longbench-repobench ↓	quality main needle-paulgraham ↑	quality delta vs ref needle-paulgraham ↑	speedup vs ref needle-paulgraham ↑	budget utilization needle-paulgraham ↑	peak kv memory mb needle-paulgraham ↓
fastkv	baseline	41.667	1.733	0.340	287.344	100.000	1.736	0.340	320.078	45.500	-1.000	1.380	0.340	540.117	8.182	0.000	1.297	0.340	220.938
fullkv	baseline	41.667	1.741	1.000	845.508	100.000	1.027	1.000	941.602	46.500	0.000	1.134	1.000	1588.672	8.182	0.000	1.029	1.000	650.000
h2o	baseline	41.667	0.663	0.100	84.375	100.000	0.551	0.100	94.141	45.333	-1.167	0.716	0.100	158.789	5.247	-2.935	0.878	0.100	64.844
snapkv	baseline	41.667	1.120	0.100	84.375	100.000	1.107	0.100	94.141	46.500	0.000	1.061	0.100	158.789	7.597	-0.585	1.078	0.100	64.844
streamingllm	baseline	41.667	1.117	0.100	84.375	100.000	1.038	0.100	94.141	45.500	-1.000	1.358	0.100	158.789	2.576	-5.606	1.388	0.100	64.844