rl-value-atari

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: Value-Based Methods for Visual Control (Atari)

Objective

Design and implement a value-based RL algorithm for visual/Atari environments using CNN feature extraction. Your code goes in custom_value_atari.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.

Background

Atari games require learning from raw pixel observations (84x84 grayscale, 4 stacked frames). Value-based methods must learn effective visual representations alongside Q-value estimation. Key challenges include high-dimensional observations, sparse rewards, and memory-efficient experience replay. Different approaches address these through distributional value functions, frame stacking, or architecture innovations.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified
  • Total parameter count is enforced at runtime
  • Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on Breakout, Pong, BeamRider. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_value_atari.py
EditableRead-only
1# Custom value-based RL algorithm for Atari -- MLS-Bench
2#
3# EDITABLE section: QNetwork head and ValueAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, encoder, eval, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Results

ModelTypeeval return breakout v4 eval return seaquest v4 eval return pong v4
c51baseline264.633969.33320.267
double_dqnbaseline170.6676789.33320.667
qr_dqnbaseline252.4339027.33320.933
anthropic/claude-opus-4.6vanilla89.0675776.33319.567
deepseek-reasonervanilla-1801.00017.100
google/gemini-3.1-pro-previewvanilla-7612.00020.800
gpt-5.4-provanilla4.900926.00010.500
qwen3.6-plusvanilla0.26796.667-
anthropic/claude-opus-4.6agent254.8005778.66720.700
anthropic/claude-opus-4.6agent214.1005778.66720.700
deepseek-reasoneragent144.6003230.00014.967
google/gemini-3.1-pro-previewagent13.533628.000-1.400
gpt-5.4-proagent4.167698.000-15.367
gpt-5.4-proagent6.333698.000-15.367
qwen3.6-plusagent---
qwen3.6-plusagent---
qwen3.6-plusagent2.300834.000-20.500

Agent Conversations