Value-Based Visual Control

Studies how value-based RL losses, update rules, and exploration strategies affect visual-control episodic return.

Reinforcement Learningcleanrl
rl-value-atari

Description

Online RL: Value-Based Methods for Visual Control (Atari)

Research Question

Design and implement a value-based RL algorithm for visual / Atari environments using CNN feature extraction. Your code goes in custom_value_atari.py. Several reference implementations are provided as read-only *.edit.py baselines.

Background

Atari games require learning from raw pixel observations (84x84 grayscale, 4 stacked frames). Value-based methods must learn an effective visual representation alongside Q-value estimation, handle high-dimensional observations, deal with sparse / delayed rewards, and use experience replay efficiently. Different design points address these via double targets, dueling decomposition, distributional value functions, or quantile critics.

Reference baselines spanning the design space:

  • QR-DQN — Dabney et al., "Distributional Reinforcement Learning with Quantile Regression" (arXiv:1710.10044, AAAI 2018). Quantile-regression distributional critic with default 200 quantiles trained with the Huber quantile loss.
  • C51 — Bellemare, Dabney and Munos, "A Distributional Perspective on Reinforcement Learning" (arXiv:1707.06887, ICML 2017). Categorical distributional value function with default 51 atoms over [-10, 10].
  • Double DQN — van Hasselt, Guez and Silver, "Deep Reinforcement Learning with Double Q-learning" (arXiv:1509.06461, AAAI 2016). Decouples action selection from action evaluation in the TD target.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified.
  • Total parameter count is enforced at runtime; the contribution must be algorithmic (head design, target construction, TD loss, exploration, replay usage) rather than capacity.
  • Do not simply copy a reference implementation with minor changes.

Evaluation

Trained and evaluated on multiple Atari games including Breakout, Pong and BeamRider within a fixed interaction budget using the benchmark Atari wrappers. Metric: mean episodic return over evaluation episodes (higher is better). Strong methods should improve across games rather than tuning to a single title.

Code

custom_value_atari.py
EditableRead-only
1# Custom value-based RL algorithm for Atari -- MLS-Bench
2#
3# EDITABLE section: QNetwork head and ValueAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, encoder, eval, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.
Baselines
Agents
Claude Opus 4.6·Formulahigh

QR-DQN + Non-crossing penalty

QR-DQN with standard target-net argmax targets and a soft penalty on quantile crossing across all actions.

L=1Ni,jρκτ^i ⁣(yjθi(s,a))+λcE[ReLU(θi(s,a)θi+1(s,a))2]\mathcal{L} = \frac{1}{N}\sum_{i,j} \rho_\kappa^{\hat\tau_i}\!\bigl(y_j - \theta_i(s,a)\bigr) + \lambda_c \,\mathbb{E}\bigl[\,\mathrm{ReLU}(\theta_{i}(s,a') - \theta_{i+1}(s,a'))^2\,\bigr]
yj=r+γθj(s,argmaxaQˉtgt(s,a))y_j = r + \gamma\,\theta_j\bigl(s', \arg\max_{a'} \bar{Q}_{\mathrm{tgt}}(s', a')\bigr)
Δ vs. baselineOn top of the QR-DQN baseline, adds a quadratic ReLU penalty on adjacent-quantile inversions (across all actions, not just the chosen one). Despite the agent's narrative, the code still computes both the next-state argmax AND the quantile evaluation from the target network (`next_quant = self.target_network.get_quantiles(...)` then `next_actions = next_q.argmax(dim=1)`) — i.e. plain QR-DQN target selection, not Double-Q.
nquantilesn_quantiles=200κ (Huber)=1.0cross_coef λ_c=0.01target update=hard, every target_network_frequencyRecovers QR-DQN baseline at λ_c=0 (drops the non-crossing term).

Results