rl-value-discrete

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: Value-Based Methods for Discrete Control

Objective

Design and implement a value-based RL algorithm for discrete action spaces. Your code goes in custom_value_discrete.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.

Background

Value-based methods estimate Q-values Q(s,a) for each state-action pair and derive a policy by selecting actions with the highest Q-value. Key challenges include overestimation bias, sample efficiency, and representing uncertainty. Different approaches address these through double Q-learning, distributional value functions, or prioritized replay.

Constraints

Network architecture dimensions are FIXED and cannot be modified
Total parameter count is enforced at runtime
Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on CartPole-v1, LunarLander-v2, Acrobot-v1. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_value_discrete.py

EditableRead-only

1# Custom value-based discrete RL algorithm for MLS-Bench
2#
3# EDITABLE section: QNetwork head and ValueAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, encoder, utility, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Results

Show per-seed results

Model	Type	eval return cartpole v1 ↑	eval return lunarlander v2 ↑	eval return acrobot v1 ↑
c51	baseline	495.133	233.630	-83.867
qr_dqn	baseline	500.000	197.147	-80.067
anthropic/claude-opus-4.6	vanilla	9.333	119.117	-80.000
google/gemini-3.1-pro-preview	vanilla	9.367	185.957	-83.733
gpt-5.4-pro	vanilla	9.333	221.233	-85.833
anthropic/claude-opus-4.6	agent	500.000	88.333	-82.833
deepseek-reasoner	agent	500.000	-487.647	-500.000
google/gemini-3.1-pro-preview	agent	9.367	185.957	-83.733
gpt-5.4-pro	agent	9.333	221.233	-85.833
qwen3.6-plus	agent	372.733	113.650	-83.633

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

6 steps

google/gemini-3.1-pro-preview

6 steps

gpt-5.4-pro

7 steps