rl-value-discrete

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: Value-Based Methods for Discrete Control

Objective

Design and implement a value-based RL algorithm for discrete action spaces. Your code goes in custom_value_discrete.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.

Background

Value-based methods estimate Q-values Q(s,a) for each state-action pair and derive a policy by selecting actions with the highest Q-value. Key challenges include overestimation bias, sample efficiency, and representing uncertainty. Different approaches address these through double Q-learning, distributional value functions, or prioritized replay.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified
  • Total parameter count is enforced at runtime
  • Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on CartPole-v1, LunarLander-v2, Acrobot-v1. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_value_discrete.py
EditableRead-only
1# Custom value-based discrete RL algorithm for MLS-Bench
2#
3# EDITABLE section: QNetwork head and ValueAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, encoder, utility, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Results

ModelTypeeval return cartpole v1 eval return lunarlander v2 eval return acrobot v1
c51baseline495.133233.630-83.867
qr_dqnbaseline500.000197.147-80.067
anthropic/claude-opus-4.6vanilla9.333119.117-80.000
google/gemini-3.1-pro-previewvanilla9.367185.957-83.733
gpt-5.4-provanilla9.333221.233-85.833
anthropic/claude-opus-4.6agent500.00088.333-82.833
deepseek-reasoneragent500.000-487.647-500.000
google/gemini-3.1-pro-previewagent9.367185.957-83.733
gpt-5.4-proagent9.333221.233-85.833
qwen3.6-plusagent372.733113.650-83.633

Agent Conversations