ar-video-kv-temporal-policy

Autonomous RoboticsFARrigorous codebase

Description

ar-video-kv-temporal-policy

Research Question

Design a better temporal frame-retention policy for autoregressive video generation. Given a fixed host model (FAR — Flow Autoregressive Reconstruction, a transformer-based AR video model) and a hard frame-count budget, can you decide which historical frames to keep, demote, or drop at each generation step so that long-horizon video quality (PSNR/SSIM/LPIPS vs ground truth) stays high while the retained-history size stays low?

This is a real-rollout benchmark: the policy directly controls which latent frames the FAR transformer attends to during video prediction on UCF-101 and DMLab. Metrics are measured on real decoded pixel frames against ground-truth video clips — not proxies.

This task isolates one scientific question: temporal history management for AR video generation. You may not change the video tokenizer, sampler, architecture, dataset, or prompt set.

Harness Design

The evaluator runs FAR (far_model.py + autoencoder_dc_model.py) directly in a policy-managed generation loop:

  1. Encode the first n_context real video frames to latent space.
  2. For each prediction step (1 to n_predict): a. Run FAR's ODE solver to generate the next latent frame, attending to all frames in kept_latents. b. Append the generated latent to kept_latents. c. Call policy.build_temporal_cache_plan(chunk_meta_list, rollout_state, budget_state). d. Apply the plan: literally prune kept_latents to only the frames marked keep or demote_to_long_term (respecting budget_state['max_frames']).
  3. Decode all generated frames and measure PSNR / SSIM / LPIPS vs. ground-truth frames.

The policy therefore controls which real video latents the model attends to at every generation step. Evicted frames are truly removed from the attention context — they are not re-encoded in subsequent steps.

Key model facts:

  • FAR-B (130M params, far_model.py): used for UCF-101 short prediction.
  • FAR-B-Long (150M params, far_long_model.py): used for DMLab long prediction.
  • DCAE (8× spatial compression, 32 latent channels, 64px): encodes/decodes pixel frames.
  • Inference: 20 ODE steps (FlowMatchEulerDiscreteScheduler), unconditional generation.

What You Can Modify

You may edit only the VideoTemporalKVPolicy class in custom_video_eval.py.

The single editable method is:

  • build_temporal_cache_plan(chunk_meta_list, rollout_state, budget_state) -> TemporalCachePlan

chunk_meta_list contains one FrameMeta per frame currently in kept_latents:

FieldDescription
chunk_id0-based index in current history (0 = oldest)
ageframes since creation (0 = most recent)
dynamic_agesame as age
recentness1 / (1 + age)
size_mblatent memory footprint (float32)
boundary_scorenormalised latent L2 change from prev frame [0,1]
feature_driftcumulative L2 drift from first context frame [0,1]
temporal_similaritycosine similarity to previous frame [0,1]
motion_strengthmagnitude of frame-to-frame latent change [0,1]
keyframe_scoresame as motion_strength (local peak proxy)
long_term_residentalways False in this harness
compression_statealways "none"

rollout_state:

  • step: current prediction step (0 = generating first new frame)
  • budget_capacity_mb: latent buffer capacity for this budget regime
  • total_steps: total number of prediction steps

budget_state:

  • max_frames: hard upper bound on number of frames to retain
  • capacity_mb: same as budget_capacity_mb

TemporalCachePlan fields the policy may set:

FieldTypeDescription
chunk_decisionsList[ChunkDecision]per-frame action ("keep"/"drop"/"demote_to_long_term") + priority
retention_familystrdescriptive tag: "recency" / "anchor" / "queue" / "chunkwise"
anchor_preservation_rulestrinformational only in this harness
boundary_transition_rulestrinformational only in this harness
queue_depthintinformational only in this harness
long_term_budget_fractionfloatinformational only in this harness
chunkwise_reuseboolinformational only
compression_modestrinformational only

The evaluator's _apply_plan enforces two rules on top of chunk_decisions:

  1. The most recent frame is always kept (even if marked drop).
  2. If the number of kept frames > budget_state['max_frames'], the oldest frames are dropped until the budget is met.

What You Cannot Modify

  • the host FAR model (transformer + DCAE)
  • the evaluation datasets (UCF-101 test set, DMLab)
  • the generation protocol (n_context, n_predict, n_inference_steps, seed)
  • the metric definitions (PSNR / SSIM / LPIPS)

Controlled Evaluation

Visible workloads:

WorkloadDatasetContextPredictModelBudget
ucf101_short_predictionUCF-101 (50 clips)5 frames11 framesFAR-Bmedium_history_budget
ucf101_short_predictionUCF-101 (50 clips)5 frames11 framesFAR-Btight_history_budget
dmlab_long_predictionDMLab (20 clips)36 frames36 framesFAR-B-Longmedium_history_budget

Visible budget regimes:

BudgetMax FramesDescription
full_history_budgetunlimitedkeep all frames (upper-bound quality reference)
medium_history_budget8 framesmoderate memory pressure
tight_history_budget4 framesaggressive eviction

Metrics

The parser expects TEST_METRICS: lines with:

MetricDescription
temporal_quality_mainprimary quality score (0–100; maps PSNR linearly, capped at 40 dB → 100)
psnrPeak Signal-to-Noise Ratio (dB) vs ground truth predicted frames
ssimStructural Similarity Index vs ground truth
lpipsLPIPS perceptual distance (AlexNet; lower is better)
peak_kv_memory_mbretained latent buffer size (MB) at max_frames
fpspredicted frames per second
n_frames_kepteffective retained history length at max budget
eval_modereal_rollout (always)

Canonical Baselines

BaselineFamilyDescription
recent_windowrecencynaive most-recent: keep the highest-recentness frames
packcacheanchorpaper-backed (arXiv:2601.04359): temporal anchor + keyframe check precedes drift eviction
causalcache_vdmqueuepaper-backed (arXiv:2411.16375): shared pool + queue eviction + long-term resident promotion
flowcachechunkwiseSOTA paper-backed (arXiv:2602.10825): chunkwise adaptive caching with boundary detection

Baseline arXiv Sources

BaselineCanonical PaperarXiv
packcachePackCache: Training-Free Acceleration for Unified Autoregressive Video Generation via Compact KV-Cache2601.04359
flowcacheFlow Caching for Autoregressive Video Generation2602.10825
causalcache_vdmCa2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing2411.16375

FAR Model Weights

Required weights (placed at FAR_WEIGHTS_DIR=/data/far_weights/):

FileModelTask
short_video_prediction/FAR_B_UCF101_Uncond64-381d295f.pthFAR-B transformerUCF-101 prediction
long_video_prediction/FAR_B_Long_DMLab_Action64-c09441dc.pthFAR-B-Long transformerDMLab prediction
dcae/DCAE_UCF101_Res64-9da18dcf.pthDCAE VAEUCF-101 encode/decode
dcae/DCAE_DMLab_Res64-17035ae5.pthDCAE VAEDMLab encode/decode

HuggingFace repository: guyuchao/FAR_Models

Notes

  • FAR is the fixed host model / runtime substrate for this domain. It does not appear as a baseline.
  • Features in FrameMeta (boundary_score, temporal_similarity, etc.) are computed from the latent tensors, not from decoded pixel space. They are lightweight and aligned with the model's internal representation.
  • demote_to_long_term is treated identically to keep in _apply_plan; policies may use it as a semantic tag.
  • The policy is evaluated under a strict frame-count budget. Memory efficiency matters — policies that waste budget on uninformative frames will produce lower-quality predictions.
  • For long-horizon DMLab prediction (36 context + 36 predict), frame retention decisions compound: wrong choices early propagate to many subsequent steps.
  • Reference quality at full_history_budget on UCF-101: PSNR ≈ 23.1 dB, SSIM ≈ 0.78, LPIPS ≈ 0.056 (50 clips, 20 ODE steps).

Code

custom_video_eval.py
EditableRead-only
1"""Real FAR rollout harness for ar-video-kv-temporal-policy.
2
3Evaluates VideoTemporalKVPolicy by running actual FAR (Flow Autoregressive
4Reconstruction) inference on UCF101/DMLab clips and measuring PSNR/SSIM/LPIPS
5vs ground truth frames.
6
7Policy hook design
8------------------
9Rather than using FAR's built-in KV cache (which only avoids recomputation),
10the harness prunes the `latents` tensor itself: at each generation step, the
11policy decides which historical frames the model is allowed to attend to. Only
12those frames are kept in `latents` and passed to the model. This directly
13measures the quality impact of temporal history management.
14
15Execution contexts

Results

No results yet.