rl-value-discrete
Description
Online RL: Value-Based Methods for Discrete Control
Objective
Design and implement a value-based RL algorithm for discrete action spaces. Your code goes in custom_value_discrete.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.
Background
Value-based methods estimate Q-values Q(s,a) for each state-action pair and derive a policy by selecting actions with the highest Q-value. Key challenges include overestimation bias, sample efficiency, and representing uncertainty. Different approaches address these through double Q-learning, distributional value functions, or prioritized replay.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on CartPole-v1, LunarLander-v2, Acrobot-v1. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
Code
1# Custom value-based discrete RL algorithm for MLS-Bench2#3# EDITABLE section: QNetwork head and ValueAlgorithm classes.4# FIXED sections: everything else (config, env, buffer, encoder, utility, training loop).5import os6import random7import time8from dataclasses import dataclass910import gymnasium as gym11import numpy as np12import torch13import torch.nn as nn14import torch.nn.functional as F15import torch.optim as optim
Results
| Model | Type | eval return cartpole v1 ↑ | eval return lunarlander v2 ↑ | eval return acrobot v1 ↑ |
|---|---|---|---|---|
| c51 | baseline | 495.133 | 233.630 | -83.867 |
| qr_dqn | baseline | 500.000 | 197.147 | -80.067 |
| anthropic/claude-opus-4.6 | vanilla | 9.333 | 119.117 | -80.000 |
| google/gemini-3.1-pro-preview | vanilla | 9.367 | 185.957 | -83.733 |
| gpt-5.4-pro | vanilla | 9.333 | 221.233 | -85.833 |
| anthropic/claude-opus-4.6 | agent | 500.000 | 88.333 | -82.833 |
| deepseek-reasoner | agent | 500.000 | -487.647 | -500.000 |
| google/gemini-3.1-pro-preview | agent | 9.367 | 185.957 | -83.733 |
| gpt-5.4-pro | agent | 9.333 | 221.233 | -85.833 |
| qwen3.6-plus | agent | 372.733 | 113.650 | -83.633 |