Agent Conversation: anthropic/claude-opus-4.6
Task: rl-value-discrete
Online RL: Value-Based Methods for Discrete Control
Objective
Design and implement a value-based RL algorithm for discrete action spaces. Your code goes in custom_value_discrete.py. Three reference implementations (DQN, DoubleDQN, C51) are provided as read-only.
Background
Value-based methods estimate Q-values Q(s,a) for each state-action pair and derive a policy by selecting actions with the highest Q-value. Key challenges include overestimation bias, sample efficiency, and representing uncertainty. Different approaches address these through double Q-learning, distributional value functions, or prioritized replay.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on CartPole-v1, LunarLander-v2, Acrobot-v1. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_value_discrete.py [EDITABLE — lines 174–242 only]
1: # Custom value-based discrete RL algorithm for MLS-Bench
2: #
3: # EDITABLE section: QNetwork head and ValueAlgorithm classes.
4: # FIXED sections: everything else (config, env, buffer, encoder, utility, training loop).
5: import os
6: import random
7: import time
8: from dataclasses import dataclass
9:
10: import gymnasium as gym
11: import numpy as np
12: import torch
13: import torch.nn as nn
14: import torch.nn.functional as F
15: import torch.optim as optim
16: import tyro
17:
18:
19: # =====================================================================
20: # FIXED: Configuration
21: # =====================================================================
22: @dataclass
23: class Args:
24: exp_name: str = os.path.basename(__file__)[: -len(".py")]
25: """the name of this experiment"""
26: seed: int = 1
27: """seed of the experiment"""
28: torch_deterministic: bool = True
29: """if toggled, `torch.backends.cudnn.deterministic=False`"""
30: cuda: bool = True
31: """if toggled, cuda will be enabled by default"""
32:
33: # Algorithm specific arguments
34: env_id: str = "CartPole-v1"
35: """the id of the environment"""
36: total_timesteps: int = 500000
37: """total timesteps of the experiments"""
38: learning_rate: float = 2.5e-4
39: """the learning rate of the optimizer"""
40: buffer_size: int = 10000
41: """the replay memory buffer size"""
42: gamma: float = 0.99
43: """the discount factor gamma"""Tool Input
1{
2 "error": "Failed to parse tool arguments: Expecting ',' delimiter: line 1 column 109 (char 108)"
3}ERROR: Failed to parse tool arguments: Expecting ',' delimiter: line 1 column 109 (char 108)
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_discrete.py",
4 "start_line": 174,
5 "end_line": 242,
6 "content": "class QNetwork(nn.Module):\n \"\"\"Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.\"\"\"\n\n def __init__(self, obs_dim, n_actions):\n super().__init__()\n self.encoder = MLPEncoder(obs_dim)\n self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions)\n\n def forward(self, obs):\n features = self.encoder(obs)\n return self.head(features)\n\n\nclass ValueAlgorithm:\n \"\"\"Munchausen Double DQN: entropy-augmented bootstrapping with double\n Q-learning. Adds a scaled log-policy bonus to rewards and uses soft\n value targets, providing implicit KL regularization. Uses the online\n network's policy for next-state action ranking (double Q-learning)\n while evaluating with the target network's Q-values.\n \"\"\"\n\n def __init__(self, obs_dim, n_actions, device, args):\n self.device = device\n self.n_actions = n_actions\n self.gamma = args.gamma\n self.total_it = 0\n self.m_alpha = 0.9 # Munchausen scaling factor\n self.m_tau = 0.03 # entropy temperature\n self.lo = -1.0 # log-policy clipping lower bound\n self.q_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate)\n\n def select_action(self, obs, epsilon):\n if random.random() < epsilon:\n return random.randint(0, self.n_actions - 1)\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n q_values = self.q_network(obs_t)\n return torch.argmax(q_values, dim=1).item()\n\n def update(self, batch, global_step):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n with torch.no_grad():\n # Double Q: use online net's policy, target net's Q-values\n q_next_online = self.q_network(next_obs)\n q_next_target = self.target_network(next_obs)\n # Soft policy from online network for next state\n log_pi_next = F.log_softmax(q_next_online / self.m_tau, dim=1)\n pi_next = log_pi_next.exp()\n # Soft value using target Q-values but online policy (Double)\n v_next = (pi_next * (q_next_target - self.m_tau * log_pi_next)).sum(1)\n # Munchausen bonus: log-policy for current action (target net)\n q_curr_tgt = self.target_network(obs)\n lp = F.log_softmax(q_curr_tgt / self.m_tau, dim=1)\n lp_a = lp[torch.arange(len(obs)), actions].clamp(self.lo, 0)\n # Augmented Bellman target\n td_target = rewards + self.m_alpha * self.m_tau * lp_a + \\\n self.gamma * (1 - dones) * v_next\n old_val = self.q_network(obs).gather(1, actions.unsqueeze(1)).squeeze(1)\n loss = F.smooth_l1_loss(old_val, td_target)\n self.optimizer.zero_grad()\n loss.backward()\n torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=10.0)\n self.optimizer.step()\n return {\"td_loss\": loss.item(), \"q_values\": old_val.mean().item()}"
7}OK: Replaced lines 174..242 with 67 line(s) in cleanrl/cleanrl/custom_value_discrete.py. Editable range: 174–240. [Current file: cleanrl/cleanrl/custom_value_discrete.py | editable: 174–240 | total: 327 lines] 171| # ===================================================================== 172| # EDITABLE: QNetwork head and ValueAlgorithm 173| # ===================================================================== 174 class QNetwork(nn.Module): 175 """Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.""" 176 177 def __init__(self, obs_dim, n_actions): 178 super().__init__() 179 self.encoder = MLPEncoder(obs_dim) 180 self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions) 181 182 def forward(self, obs): 183 features = self.encoder(obs) 184 return self.head(features) 185 186 187 class ValueAlgorithm: 188 """Munchausen Double DQN: entropy-augmented bo
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### cartpole-v1 (scripts/cartpole.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=8831448.0000 q_values=26819108.0000 TRAIN_METRICS step=496000 td_loss=9037152.0000 q_values=27934608.0000 TRAIN_METRICS step=497000 td_loss=9948830.0000 q_values=29775726.0000 TRAIN_METRICS step=498000 td_loss=10187424.0000 q_values=28719050.0000 TRAIN_METRICS step=499000 td_loss=10196548.0000 q_values=29632152.0000 Evaluation (cartpole-v1): Eval episodic_return: 9.10 Eval episodic_return: 9.10 Eval episodic_return: 9.10 Final eval return: 9.10 ### lunarlander-v2 (scripts/lunarlander.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=0.3966 q_values=13.1679 TRAIN_METRICS step=496000 td_loss=0.5025 q_values=12.6280 TRAIN_METRICS step=497000 td_loss=0.2204 q_values=14.8049 TRAIN_METRICS step=498000 td_loss=0.4568 q_values=12.3258 TRAIN_METRICS step=499000 td_loss=0.8418 q_values=10.669
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_discrete.py",
4 "start_line": 174,
5 "end_line": 240,
6 "content": "class QNetwork(nn.Module):\n \"\"\"Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.\"\"\"\n\n def __init__(self, obs_dim, n_actions):\n super().__init__()\n self.encoder = MLPEncoder(obs_dim)\n self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions)\n\n def forward(self, obs):\n features = self.encoder(obs)\n return self.head(features)\n\n\nclass ValueAlgorithm:\n \"\"\"Pessimistic Munchausen Double DQN (PM-DDQN).\n Novel combination of: (1) Double Q-learning for action selection,\n (2) Munchausen reward augmentation adding scaled log-policy bonus for\n implicit KL regularization, and (3) asymmetric (expectile) Huber loss\n that penalizes Q-value overestimation more than underestimation,\n pushing learned values toward a pessimistic expectile of the target.\n \"\"\"\n\n def __init__(self, obs_dim, n_actions, device, args):\n self.device = device\n self.n_actions = n_actions\n self.gamma = args.gamma\n self.total_it = 0\n self.m_alpha = 0.9 # Munchausen scaling\n self.m_tau = 0.03 # entropy temperature for log-policy\n self.lo = -1.0 # log-policy clipping bound\n self.asym_tau = 0.6 # expectile: >0.5 penalizes overestimation more\n self.q_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate)\n\n def select_action(self, obs, epsilon):\n if random.random() < epsilon:\n return random.randint(0, self.n_actions - 1)\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n q_values = self.q_network(obs_t)\n return torch.argmax(q_values, dim=1).item()\n\n def update(self, batch, global_step):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n with torch.no_grad():\n # Double DQN: online net selects action, target net evaluates\n next_act = self.q_network(next_obs).argmax(dim=1)\n tgt_q = self.target_network(next_obs)\n next_q = tgt_q[torch.arange(len(next_obs)), next_act]\n # Munchausen reward augmentation via target net log-policy\n q_s = self.target_network(obs)\n lp = F.log_softmax(q_s / self.m_tau, dim=1)\n lp_a = lp[torch.arange(len(obs)), actions].clamp(self.lo, 0)\n aug_r = rewards + self.m_alpha * self.m_tau * lp_a\n td_target = aug_r + self.gamma * (1 - dones) * next_q\n old_val = self.q_network(obs).gather(1, actions.unsqueeze(1)).squeeze(1)\n # Asymmetric Huber loss (expectile): penalizes overestimation more\n delta = old_val - td_target\n abs_d = torch.abs(delta)\n huber = torch.where(abs_d <= 1.0, 0.5 * delta ** 2, abs_d - 0.5)\n w = torch.where(delta.detach() > 0, self.asym_tau, 1.0 - self.asym_tau)\n loss = (w * huber).mean()\n self.optimizer.zero_grad()\n loss.backward()\n torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=10.0)\n self.optimizer.step()\n return {\"td_loss\": loss.item(), \"q_values\": old_val.mean().item()}"
7}OK: Replaced lines 174..240 with 69 line(s) in cleanrl/cleanrl/custom_value_discrete.py. Editable range: 174–242. [Current file: cleanrl/cleanrl/custom_value_discrete.py | editable: 174–242 | total: 329 lines] 171| # ===================================================================== 172| # EDITABLE: QNetwork head and ValueAlgorithm 173| # ===================================================================== 174 class QNetwork(nn.Module): 175 """Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.""" 176 177 def __init__(self, obs_dim, n_actions): 178 super().__init__() 179 self.encoder = MLPEncoder(obs_dim) 180 self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions) 181 182 def forward(self, obs): 183 features = self.encoder(obs) 184 return self.head(features) 185 186 187 class ValueAlgorithm: 188 """Pessimistic Munchausen Double DQN (PM-DDQN)
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### cartpole-v1 (scripts/cartpole.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=176.5851 q_values=4600.3374 TRAIN_METRICS step=496000 td_loss=222.0134 q_values=4583.3789 TRAIN_METRICS step=497000 td_loss=234.0273 q_values=4714.8135 TRAIN_METRICS step=498000 td_loss=242.9607 q_values=4729.9619 TRAIN_METRICS step=499000 td_loss=122.4463 q_values=4809.4980 Evaluation (cartpole-v1): Eval episodic_return: 21.90 Eval episodic_return: 19.50 Eval episodic_return: 23.50 Final eval return: 23.50 ### lunarlander-v2 (scripts/lunarlander.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=0.0981 q_values=-5.7713 TRAIN_METRICS step=496000 td_loss=0.2421 q_values=-9.3700 TRAIN_METRICS step=497000 td_loss=0.1250 q_values=-9.0795 TRAIN_METRICS step=498000 td_loss=0.0987 q_values=-7.6945 TRAIN_METRICS step=499000 td_loss=0.3228 q_values=-6.7049 Evaluation (lunarlander-v2): Eval episodic_return:
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_value_discrete.py",
4 "start_line": 174,
5 "end_line": 242,
6 "content": "class QNetwork(nn.Module):\n \"\"\"Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.\"\"\"\n\n def __init__(self, obs_dim, n_actions):\n super().__init__()\n self.encoder = MLPEncoder(obs_dim)\n self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions)\n\n def forward(self, obs):\n features = self.encoder(obs)\n return self.head(features)\n\n\nclass ValueAlgorithm:\n \"\"\"Ensembled Munchausen DQN (EM-DQN). Novel combination of:\n (1) Munchausen reward augmentation: adds scaled log-policy bonus from\n the target network, providing implicit KL regularization that penalizes\n actions inconsistent with the current policy.\n (2) Ensemble action selection: averages online and target Q-networks\n for training data collection, yielding more robust exploration by\n reducing the impact of any single network's overestimation.\n (3) MSE loss for proper gradient scaling on large TD errors.\n \"\"\"\n\n def __init__(self, obs_dim, n_actions, device, args):\n self.device = device\n self.n_actions = n_actions\n self.gamma = args.gamma\n self.total_it = 0\n self.m_alpha = 0.9 # Munchausen scaling factor\n self.m_tau = 0.03 # entropy temperature for log-policy\n self.lo = -1.0 # log-policy clipping lower bound\n self.q_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network = QNetwork(obs_dim, n_actions).to(device)\n self.target_network.load_state_dict(self.q_network.state_dict())\n self.optimizer = optim.Adam(self.q_network.parameters(), lr=args.learning_rate)\n\n def select_action(self, obs, epsilon):\n \"\"\"Ensemble epsilon-greedy: average online+target Q for selection.\"\"\"\n if random.random() < epsilon:\n return random.randint(0, self.n_actions - 1)\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n q_on = self.q_network(obs_t)\n q_tg = self.target_network(obs_t)\n q_ens = 0.5 * q_on + 0.5 * q_tg\n return torch.argmax(q_ens, dim=1).item()\n\n def update(self, batch, global_step):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n with torch.no_grad():\n # Standard max target from target network (proven stable)\n target_max = self.target_network(next_obs).max(dim=1)[0]\n # Munchausen reward augmentation via target net's log-policy\n q_s = self.target_network(obs)\n lp = F.log_softmax(q_s / self.m_tau, dim=1)\n lp_a = lp[torch.arange(len(obs)), actions].clamp(self.lo, 0)\n aug_r = rewards + self.m_alpha * self.m_tau * lp_a\n td_target = aug_r + self.gamma * (1 - dones) * target_max\n old_val = self.q_network(obs).gather(1, actions.unsqueeze(1)).squeeze(1)\n loss = F.mse_loss(old_val, td_target)\n self.optimizer.zero_grad()\n loss.backward()\n torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=10.0)\n self.optimizer.step()\n return {\"td_loss\": loss.item(), \"q_values\": old_val.mean().item()}"
7}OK: Replaced lines 174..242 with 67 line(s) in cleanrl/cleanrl/custom_value_discrete.py. Editable range: 174–240. [Current file: cleanrl/cleanrl/custom_value_discrete.py | editable: 174–240 | total: 327 lines] 171| # ===================================================================== 172| # EDITABLE: QNetwork head and ValueAlgorithm 173| # ===================================================================== 174 class QNetwork(nn.Module): 175 """Q-network: MLPEncoder (fixed) + head. Output: Q-values per action.""" 176 177 def __init__(self, obs_dim, n_actions): 178 super().__init__() 179 self.encoder = MLPEncoder(obs_dim) 180 self.head = nn.Linear(ENCODER_FEATURE_DIM, n_actions) 181 182 def forward(self, obs): 183 features = self.encoder(obs) 184 return self.head(features) 185 186 187 class ValueAlgorithm: 188 """Ensembled Munchausen DQN (EM-DQN). Novel co
[Test #3] (0 tests remaining) ## Seed 42 ### cartpole-v1 (scripts/cartpole.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=0.0164 q_values=101.7867 TRAIN_METRICS step=496000 td_loss=0.0138 q_values=101.6135 TRAIN_METRICS step=497000 td_loss=0.0096 q_values=101.6237 TRAIN_METRICS step=498000 td_loss=0.0074 q_values=101.6280 TRAIN_METRICS step=499000 td_loss=0.0139 q_values=101.9030 Evaluation (cartpole-v1): Eval episodic_return: 500.00 Eval episodic_return: 500.00 Eval episodic_return: 500.00 Final eval return: 500.00 ### lunarlander-v2 (scripts/lunarlander.sh) Training metrics (last steps): TRAIN_METRICS step=495000 td_loss=0.6554 q_values=27.0310 TRAIN_METRICS step=496000 td_loss=1.4189 q_values=26.3545 TRAIN_METRICS step=497000 td_loss=1.8950 q_values=31.8224 TRAIN_METRICS step=498000 td_loss=2.4778 q_values=24.1690 TRAIN_METRICS step=499000 td_loss=1.1699 q_values=29.6376 Evaluation (lunarlander-v2): Eval episodic_return: 186.41 Eval episodic_return: 186.76