rl-value-atari
Description
Online RL: Value-Based Methods for Visual Control (Atari)
Objective
Design and implement a value-based RL algorithm for visual/Atari environments using CNN feature extraction. Your code goes in custom_value_atari.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.
Background
Atari games require learning from raw pixel observations (84x84 grayscale, 4 stacked frames). Value-based methods must learn effective visual representations alongside Q-value estimation. Key challenges include high-dimensional observations, sparse rewards, and memory-efficient experience replay. Different approaches address these through distributional value functions, frame stacking, or architecture innovations.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on Breakout, Pong, BeamRider. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
Code
1# Custom value-based RL algorithm for Atari -- MLS-Bench2#3# EDITABLE section: QNetwork head and ValueAlgorithm classes.4# FIXED sections: everything else (config, env, buffer, encoder, eval, training loop).5import os6import random7import time8from dataclasses import dataclass910import gymnasium as gym11import numpy as np12import torch13import torch.nn as nn14import torch.nn.functional as F15import torch.optim as optim
Results
| Model | Type | eval return breakout v4 ↑ | eval return seaquest v4 ↑ | eval return pong v4 ↑ |
|---|---|---|---|---|
| c51 | baseline | 264.633 | 969.333 | 20.267 |
| double_dqn | baseline | 170.667 | 6789.333 | 20.667 |
| qr_dqn | baseline | 252.433 | 9027.333 | 20.933 |
| anthropic/claude-opus-4.6 | vanilla | 89.067 | 5776.333 | 19.567 |
| deepseek-reasoner | vanilla | - | 1801.000 | 17.100 |
| google/gemini-3.1-pro-preview | vanilla | - | 7612.000 | 20.800 |
| gpt-5.4-pro | vanilla | 4.900 | 926.000 | 10.500 |
| qwen3.6-plus | vanilla | 0.267 | 96.667 | - |
| anthropic/claude-opus-4.6 | agent | 254.800 | 5778.667 | 20.700 |
| anthropic/claude-opus-4.6 | agent | 214.100 | 5778.667 | 20.700 |
| deepseek-reasoner | agent | 144.600 | 3230.000 | 14.967 |
| google/gemini-3.1-pro-preview | agent | 13.533 | 628.000 | -1.400 |
| gpt-5.4-pro | agent | 4.167 | 698.000 | -15.367 |
| gpt-5.4-pro | agent | 6.333 | 698.000 | -15.367 |
| qwen3.6-plus | agent | - | - | - |
| qwen3.6-plus | agent | - | - | - |
| qwen3.6-plus | agent | 2.300 | 834.000 | -20.500 |