Agent Conversation: gpt-5.4-pro
Task: rl-offpolicy-continuous
Online RL: Off-Policy Actor-Critic for Continuous Control
Objective
Design and implement an off-policy actor-critic RL algorithm for continuous control. Your code goes in custom_offpolicy_continuous.py. Three reference implementations (DDPG, TD3, SAC) are provided as read-only.
Background
Off-policy methods maintain a replay buffer of past experience and update the policy using data collected under previous policies. Key challenges include overestimation bias in Q-value estimates, exploration-exploitation tradeoff, and sample efficiency. Different approaches address these through twin critics, entropy regularization, target smoothing, or delayed updates.
Evaluation
Trained and evaluated on three MuJoCo locomotion environments: HalfCheetah-v4, Hopper-v4, Walker2d-v4. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_offpolicy_continuous.py [EDITABLE — lines 142–235 only]
1: # Custom off-policy continuous RL algorithm for MLS-Bench
2: #
3: # EDITABLE section: Actor, QNetwork, and OffPolicyAlgorithm classes.
4: # FIXED sections: everything else (config, env, buffer, eval, training loop).
5: import os
6: import random
7: import time
8: from dataclasses import dataclass
9:
10: import gymnasium as gym
11: import numpy as np
12: import torch
13: import torch.nn as nn
14: import torch.nn.functional as F
15: import torch.optim as optim
16: import tyro
17:
18:
19: # =====================================================================
20: # FIXED: Configuration
21: # =====================================================================
22: @dataclass
23: class Args:
24: exp_name: str = os.path.basename(__file__)[: -len(".py")]
25: """the name of this experiment"""
26: seed: int = 1
27: """seed of the experiment"""
28: torch_deterministic: bool = True
29: """if toggled, `torch.backends.cudnn.deterministic=False`"""
30: cuda: bool = True
31: """if toggled, cuda will be enabled by default"""
32:
33: # Algorithm specific arguments
34: env_id: str = "HalfCheetah-v4"
35: """the id of the environment"""
36: total_timesteps: int = 1000000
37: """total timesteps of the experiments"""
38: learning_rate: float = 3e-4
39: """the learning rate of the optimizer"""
40: buffer_size: int = int(1e6)
41: """the replay memory buffer size"""
42: gamma: float = 0.99
43: """the discount factor gamma"""
44: tau: float = 0.005
45: """target smoothing coefficient (default: 0.005)"""
46: batch_size: int = 256
47: """the batch size of sample from the replay memory"""
48: learning_starts: int = 25000
49: """timestep to start learning"""
50: policy_frequency: inTool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "LOG_STD_MAX = 2\nLOG_STD_MIN = -5\n\n\nclass Actor(nn.Module):\n \"\"\"Stochastic tanh-Gaussian actor with deterministic evaluation.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action):\n super().__init__()\n self.fc1 = nn.Linear(obs_dim, 256)\n self.fc2 = nn.Linear(256, 256)\n self.fc_mean = nn.Linear(256, action_dim)\n self.fc_logstd = nn.Linear(256, action_dim)\n self.register_buffer(\"action_scale\", torch.tensor(max_action, dtype=torch.float32))\n\n def _dist(self, obs):\n x = F.relu(self.fc1(obs))\n x = F.relu(self.fc2(x))\n mean = self.fc_mean(x)\n log_std = torch.tanh(self.fc_logstd(x))\n log_std = LOG_STD_MIN + 0.5 * (LOG_STD_MAX - LOG_STD_MIN) * (log_std + 1)\n return mean, log_std\n\n def forward(self, obs):\n mean, _ = self._dist(obs)\n return torch.tanh(mean) * self.action_scale\n\n def sample(self, obs):\n mean, log_std = self._dist(obs)\n std = log_std.exp()\n normal = torch.distributions.Normal(mean, std)\n x_t = normal.rsample()\n y_t = torch.tanh(x_t)\n action = y_t * self.action_scale\n log_prob = normal.log_prob(x_t)\n log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + 1e-6)\n log_prob = log_prob.sum(1, keepdim=True)\n mean_action = torch.tanh(mean) * self.action_scale\n return action, log_prob, mean_action\n\n @torch.no_grad()\n def get_action(self, obs):\n return self.forward(obs)\n\n\nclass QNetwork(nn.Module):\n \"\"\"Q-function Q(s, a) -> scalar.\"\"\"\n\n def __init__(self, obs_dim, action_dim):\n super().__init__()\n self.fc1 = nn.Linear(obs_dim + action_dim, 256)\n self.fc2 = nn.Linear(256, 256)\n self.fc3 = nn.Linear(256, 1)\n\n def forward(self, obs, action):\n x = torch.cat([obs, action], dim=-1)\n x = F.relu(self.fc1(x))\n x = F.relu(self.fc2(x))\n return self.fc3(x)\n\n\nclass OffPolicyAlgorithm:\n \"\"\"Improved SAC: twin critics, entropy tuning, deterministic eval policy.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.policy_frequency = args.policy_frequency\n self.target_entropy = -float(action_dim)\n self.total_it = 0\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()), lr=args.learning_rate\n )\n self.log_alpha = torch.zeros(1, device=device, requires_grad=True)\n self.alpha_optimizer = optim.Adam([self.log_alpha], lr=args.learning_rate)\n self.alpha = self.log_alpha.exp().item()\n\n def select_action(self, obs):\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action, _, _ = self.actor.sample(obs_t)\n return action.cpu().numpy().flatten()\n\n def update(self, batch):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n\n with torch.no_grad():\n next_actions, next_log_pi, _ = self.actor.sample(next_obs)\n q1_next = self.qf1_target(next_obs, next_actions).view(-1)\n q2_next = self.qf2_target(next_obs, next_actions).view(-1)\n min_q_next = torch.min(q1_next, q2_next) - self.log_alpha.exp().detach() * next_log_pi.view(-1)\n td_target = rewards + (1 - dones) * self.gamma * min_q_next\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = F.mse_loss(q1, td_target) + F.mse_loss(q2, td_target)\n\n self.q_optimizer.zero_grad(set_to_none=True)\n critic_loss.backward()\n torch.nn.utils.clip_grad_norm_(list(self.qf1.parameters()) + list(self.qf2.parameters()), 10.0)\n self.q_optimizer.step()\n\n actor_loss_val = 0.0\n if self.total_it % self.policy_frequency == 0:\n pi, log_pi, _ = self.actor.sample(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n actor_loss = (self.log_alpha.exp().detach() * log_pi.view(-1) - torch.min(q1_pi, q2_pi)).mean()\n\n self.actor_optimizer.zero_grad(set_to_none=True)\n actor_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 10.0)\n self.actor_optimizer.step()\n actor_loss_val = actor_loss.item()\n\n alpha_loss = (-self.log_alpha.exp() * (log_pi.detach().view(-1) + self.target_entropy)).mean()\n self.alpha_optimizer.zero_grad(set_to_none=True)\n alpha_loss.backward()\n self.alpha_optimizer.step()\n self.alpha = self.log_alpha.exp().item()\n\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n return {\"critic_loss\": critic_loss.item(), \"actor_loss\": actor_loss_val, \"alpha\": self.alpha}",
5 "after_line": 235,
6 "start_line": 142,
7 "end_line": 235
8}OK: Replaced lines 142..235 with 137 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 142–278.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 142–278 | total: 357 lines]
139| # =====================================================================
140| # EDITABLE: Network definitions and OffPolicyAlgorithm
141| # =====================================================================
142 LOG_STD_MAX = 2
143 LOG_STD_MIN = -5
144
145
146 class Actor(nn.Module):
147 """Stochastic tanh-Gaussian actor with deterministic evaluation."""
148
149 def __init__(self, obs_dim, action_dim, max_action):
150 super().__init__()
151 self.fc1 = nn.Linear(obs_dim, 256)
152 self.fc2 = nn.Linear(256, 256)
153 self.fc_mean = nn.Linear(256, action_dim)
154 self.fc_logstd = nn.Linear(256, action_dim)
155 self.register_buffer("action_scal
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": " self.policy_frequency = 1",
5 "after_line": 210,
6 "start_line": 211,
7 "end_line": 211
8}OK: Replaced lines 211..211 with 1 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 142–278.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 142–278 | total: 357 lines]
139| # =====================================================================
140| # EDITABLE: Network definitions and OffPolicyAlgorithm
141| # =====================================================================
142 LOG_STD_MAX = 2
143 LOG_STD_MIN = -5
144
145
146 class Actor(nn.Module):
147 """Stochastic tanh-Gaussian actor with deterministic evaluation."""
148
149 def __init__(self, obs_dim, action_dim, max_action):
150 super().__init__()
151 self.fc1 = nn.Linear(obs_dim, 256)
152 self.fc2 = nn.Linear(256, 256)
153 self.fc_mean = nn.Linear(256, action_dim)
154 self.fc_logstd = nn.Linear(256, action_dim)
155 self.register_buffer("action_scale"
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": " self.policy_frequency = 1\n self.exploration_noise = args.exploration_noise\n self.target_entropy = -0.5 * float(action_dim)",
5 "after_line": 210,
6 "start_line": 211,
7 "end_line": 213
8}OK: Replaced lines 211..213 with 3 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 142–278.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 142–278 | total: 357 lines]
139| # =====================================================================
140| # EDITABLE: Network definitions and OffPolicyAlgorithm
141| # =====================================================================
142 LOG_STD_MAX = 2
143 LOG_STD_MIN = -5
144
145
146 class Actor(nn.Module):
147 """Stochastic tanh-Gaussian actor with deterministic evaluation."""
148
149 def __init__(self, obs_dim, action_dim, max_action):
150 super().__init__()
151 self.fc1 = nn.Linear(obs_dim, 256)
152 self.fc2 = nn.Linear(256, 256)
153 self.fc_mean = nn.Linear(256, action_dim)
154 self.fc_logstd = nn.Linear(256, action_dim)
155 self.register_buffer("action_scale"
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": " self.policy_frequency = 1\n self.exploration_noise = args.exploration_noise\n self.target_entropy = -0.5 * float(action_dim)\n self.total_it = 0",
5 "after_line": 210,
6 "start_line": 211,
7 "end_line": 213
8}OK: Replaced lines 211..213 with 4 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 142–279.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 142–279 | total: 358 lines]
139| # =====================================================================
140| # EDITABLE: Network definitions and OffPolicyAlgorithm
141| # =====================================================================
142 LOG_STD_MAX = 2
143 LOG_STD_MIN = -5
144
145
146 class Actor(nn.Module):
147 """Stochastic tanh-Gaussian actor with deterministic evaluation."""
148
149 def __init__(self, obs_dim, action_dim, max_action):
150 super().__init__()
151 self.fc1 = nn.Linear(obs_dim, 256)
152 self.fc2 = nn.Linear(256, 256)
153 self.fc_mean = nn.Linear(256, action_dim)
154 self.fc_logstd = nn.Linear(256, action_dim)
155 self.register_buffer("action_scale"
Tool Input
1{
2 "is_final": false,
3 "use_last": false
4}## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=995000 critic_loss=28.3550 actor_loss=-835.0787 alpha=0.4376 TRAIN_METRICS step=996000 critic_loss=34.5010 actor_loss=-813.7772 alpha=0.4426 TRAIN_METRICS step=997000 critic_loss=29.2728 actor_loss=-816.4824 alpha=0.4464 TRAIN_METRICS step=998000 critic_loss=46.1811 actor_loss=-808.1501 alpha=0.4414 TRAIN_METRICS step=999000 critic_loss=26.8785 actor_loss=-818.5698 alpha=0.4392 Evaluation (halfcheetah-v4): Eval episodic_return: 11367.22 Eval episodic_return: 11663.22 Eval episodic_return: 11653.98 Final eval return: 11653.98 ### hopper-v4 (scripts/hopper.sh) Training metrics (last steps): TRAIN_METRICS step=995000 critic_loss=41.8079 actor_loss=-331.6349 alpha=0.0508 TRAIN_METRICS step=996000 critic_loss=6.1763 actor_loss=-326.6221 alpha=0.0515 TRAIN_METRICS step=997000 critic_loss=2.3411 actor_loss=-325.3407 alpha=0.0524 TRAIN_METRICS step=998000 critic_loss=8.6375 actor_loss=-3
Tool Input
1{
2 "is_final": true,
3 "use_last": true
4}## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=995000 critic_loss=28.3550 actor_loss=-835.0787 alpha=0.4376 TRAIN_METRICS step=996000 critic_loss=34.5010 actor_loss=-813.7772 alpha=0.4426 TRAIN_METRICS step=997000 critic_loss=29.2728 actor_loss=-816.4824 alpha=0.4464 TRAIN_METRICS step=998000 critic_loss=46.1811 actor_loss=-808.1501 alpha=0.4414 TRAIN_METRICS step=999000 critic_loss=26.8785 actor_loss=-818.5698 alpha=0.4392 Evaluation (halfcheetah-v4): Eval episodic_return: 11367.22 Eval episodic_return: 11663.22 Eval episodic_return: 11653.98 Final eval return: 11653.98 ### hopper-v4 (scripts/hopper.sh) Training metrics (last steps): TRAIN_METRICS step=995000 critic_loss=41.8079 actor_loss=-331.6349 alpha=0.0508 TRAIN_METRICS step=996000 critic_loss=6.1763 actor_loss=-326.6221 alpha=0.0515 TRAIN_METRICS step=997000 critic_loss=2.3411 actor_loss=-325.3407 alpha=0.0524 TRAIN_METRICS step=998000 critic_loss=8.6375 actor_loss=-3
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "LOG_STD_MAX = 2.0\nLOG_STD_MIN = -5.0\n\n\ndef _init_linear(layer, gain=np.sqrt(2.0), bias=0.0):\n nn.init.orthogonal_(layer.weight, gain)\n nn.init.constant_(layer.bias, bias)\n return layer\n\n\nclass Actor(nn.Module):\n \"\"\"Stochastic tanh-Gaussian actor with deterministic eval actions.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action):\n super().__init__()\n self.fc1 = _init_linear(nn.Linear(obs_dim, 256))\n self.fc2 = _init_linear(nn.Linear(256, 256))\n self.fc_mean = _init_linear(nn.Linear(256, action_dim), gain=1e-2)\n self.fc_logstd = _init_linear(nn.Linear(256, action_dim), gain=1e-2, bias=-1.0)\n self.register_buffer(\"action_scale\", torch.tensor(max_action, dtype=torch.float32))\n\n def _dist_params(self, obs):\n x = F.relu(self.fc1(obs))\n x = F.relu(self.fc2(x))\n mean = self.fc_mean(x)\n log_std = torch.tanh(self.fc_logstd(x))\n log_std = LOG_STD_MIN + 0.5 * (LOG_STD_MAX - LOG_STD_MIN) * (log_std + 1.0)\n return mean, log_std\n\n def forward(self, obs):\n mean, _ = self._dist_params(obs)\n return torch.tanh(mean) * self.action_scale\n\n def sample(self, obs):\n mean, log_std = self._dist_params(obs)\n std = log_std.exp()\n normal = torch.distributions.Normal(mean, std)\n x_t = normal.rsample()\n y_t = torch.tanh(x_t)\n action = y_t * self.action_scale\n log_prob = normal.log_prob(x_t)\n log_prob -= torch.log(self.action_scale * (1.0 - y_t.pow(2)) + 1e-6)\n log_prob = log_prob.sum(dim=1, keepdim=True)\n mean_action = torch.tanh(mean) * self.action_scale\n return action, log_prob, mean_action\n\n @torch.no_grad()\n def get_action(self, obs):\n return self.forward(obs)\n\n\nclass QNetwork(nn.Module):\n \"\"\"Q-function Q(s, a) -> scalar.\"\"\"\n\n def __init__(self, obs_dim, action_dim):\n super().__init__()\n self.fc1 = _init_linear(nn.Linear(obs_dim + action_dim, 256))\n self.fc2 = _init_linear(nn.Linear(256, 256))\n self.fc3 = _init_linear(nn.Linear(256, 1), gain=1e-2)\n\n def forward(self, obs, action):\n x = torch.cat([obs, action], dim=-1)\n x = F.relu(self.fc1(x))\n x = F.relu(self.fc2(x))\n return self.fc3(x)\n\n\nclass OffPolicyAlgorithm:\n \"\"\"Engineered SAC implementation tuned for continuous control.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.policy_frequency = 1\n self.grad_clip = 10.0\n self.total_it = 0\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate, eps=1e-5)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()),\n lr=args.learning_rate,\n eps=1e-5,\n )\n\n self.target_entropy = -0.75 * action_dim\n self.log_alpha = torch.tensor(np.log(0.1), device=device, requires_grad=True)\n self.alpha_optimizer = optim.Adam([self.log_alpha], lr=args.learning_rate, eps=1e-5)\n self.alpha = float(self.log_alpha.exp().item())\n\n def select_action(self, obs):\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action, _, _ = self.actor.sample(obs_t)\n return action.cpu().numpy().flatten()\n\n def update(self, batch):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n rewards = rewards.view(-1)\n dones = dones.view(-1)\n\n alpha = self.log_alpha.exp().clamp(1e-4, 10.0)\n with torch.no_grad():\n next_actions, next_log_pi, _ = self.actor.sample(next_obs)\n q1_next = self.qf1_target(next_obs, next_actions).view(-1)\n q2_next = self.qf2_target(next_obs, next_actions).view(-1)\n min_q_next = torch.min(q1_next, q2_next) - alpha * next_log_pi.view(-1)\n td_target = rewards + (1.0 - dones) * self.gamma * min_q_next\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = F.mse_loss(q1, td_target) + F.mse_loss(q2, td_target)\n\n self.q_optimizer.zero_grad(set_to_none=True)\n critic_loss.backward()\n torch.nn.utils.clip_grad_norm_(\n list(self.qf1.parameters()) + list(self.qf2.parameters()), self.grad_clip\n )\n self.q_optimizer.step()\n\n actor_loss_val = 0.0\n alpha_loss_val = 0.0\n if self.total_it % self.policy_frequency == 0:\n pi, log_pi, _ = self.actor.sample(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n actor_loss = (alpha.detach() * log_pi.view(-1) - torch.min(q1_pi, q2_pi)).mean()\n\n self.actor_optimizer.zero_grad(set_to_none=True)\n actor_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.actor.parameters(), self.grad_clip)\n self.actor_optimizer.step()\n actor_loss_val = actor_loss.item()\n\n alpha_loss = -(self.log_alpha.exp() * (log_pi.view(-1).detach() + self.target_entropy)).mean()\n self.alpha_optimizer.zero_grad(set_to_none=True)\n alpha_loss.backward()\n self.alpha_optimizer.step()\n self.alpha = float(self.log_alpha.exp().clamp(1e-4, 10.0).item())\n alpha_loss_val = alpha_loss.item()\n\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n\n return {\n \"critic_loss\": critic_loss.item(),\n \"actor_loss\": actor_loss_val,\n \"alpha\": self.alpha,\n \"alpha_loss\": alpha_loss_val,\n }\n",
5 "after_line": 244,
6 "start_line": 153,
7 "end_line": 244
8}OK: Replaced lines 153..244 with 160 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 153–312. [Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 153–312 | total: 393 lines] 150| # ===================================================================== 151| # EDITABLE: Network definitions and OffPolicyAlgorithm 152| # ===================================================================== 153 LOG_STD_MAX = 2.0 154 LOG_STD_MIN = -5.0 155 156 157 def _init_linear(layer, gain=np.sqrt(2.0), bias=0.0): 158 nn.init.orthogonal_(layer.weight, gain) 159 nn.init.constant_(layer.bias, bias) 160 return layer 161 162 163 class Actor(nn.Module): 164 """Stochastic tanh-Gaussian actor with deterministic eval actions.""" 165 166 def __init__(self, obs_dim, action_dim, max_action): 167 super().__init__() 168 self.fc1 = _init_linear(nn.Linear(obs_
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "class Actor(nn.Module):\n \"\"\"Deterministic actor used by a lower-confidence, self-anchored TD3 variant.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action):\n super().__init__()\n self.net = _mlp_factory(obs_dim, action_dim)\n self.register_buffer(\"action_scale\", torch.tensor(max_action, dtype=torch.float32))\n\n def forward(self, obs):\n return torch.tanh(self.net(obs)) * self.action_scale\n\n @torch.no_grad()\n def get_action(self, obs):\n return self.forward(obs)\n\n\nclass QNetwork(nn.Module):\n \"\"\"Q-function Q(s, a) -> scalar.\"\"\"\n\n def __init__(self, obs_dim, action_dim):\n super().__init__()\n self.net = _mlp_factory(obs_dim + action_dim, 1)\n\n def forward(self, obs, action):\n return self.net(torch.cat([obs, action], dim=-1))\n\n\nclass OffPolicyAlgorithm:\n \"\"\"Lower-Confidence Self-Anchored TD3.\n\n Novel ingredients beyond vanilla TD3:\n 1) lower-confidence actor/value optimization via critic disagreement penalties;\n 2) self-anchored advantage-weighted behavior regularization, where the actor only\n imitates replay actions when they score better than the current policy action;\n 3) disagreement-adaptive exploration noise.\n \"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.exploration_noise = args.exploration_noise\n self.policy_frequency = args.policy_frequency\n self.policy_noise = 0.2\n self.noise_clip = 0.5\n self.actor_bc_alpha = 2.5\n self.actor_lcb_scale = 0.25\n self.target_lcb_scale = 0.10\n self.consistency_coef = 0.05\n self.disagreement_ema = 1.0\n self.total_it = 0\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor.load_state_dict(self.actor.state_dict())\n\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()), lr=args.learning_rate\n )\n\n def select_action(self, obs):\n \"\"\"Exploration scales up mildly in states where the critics disagree more.\"\"\"\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action_t = self.actor(obs_t)\n q_gap = (self.qf1(obs_t, action_t) - self.qf2(obs_t, action_t)).abs().item()\n noise_mult = 1.0 + 0.5 * np.tanh(q_gap / (self.disagreement_ema + 1e-6))\n action = action_t.cpu().numpy().flatten()\n noise = np.random.normal(\n 0, self.max_action * self.exploration_noise * noise_mult, size=action.shape\n )\n return np.clip(action + noise, -self.max_action, self.max_action)\n\n def update(self, batch):\n \"\"\"Single gradient update. Returns a dict of scalar metrics.\"\"\"\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n\n with torch.no_grad():\n noise = (torch.randn_like(actions) * self.policy_noise).clamp(\n -self.noise_clip, self.noise_clip\n ) * self.max_action\n next_actions = (self.target_actor(next_obs) + noise).clamp(-self.max_action, self.max_action)\n target_q1 = self.qf1_target(next_obs, next_actions).view(-1)\n target_q2 = self.qf2_target(next_obs, next_actions).view(-1)\n target_q = torch.min(target_q1, target_q2) - self.target_lcb_scale * (target_q1 - target_q2).abs()\n td_target = rewards + (1 - dones) * self.gamma * target_q\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = (\n F.mse_loss(q1, td_target)\n + F.mse_loss(q2, td_target)\n + self.consistency_coef * (q1 - q2).pow(2).mean()\n )\n\n self.q_optimizer.zero_grad()\n critic_loss.backward()\n self.q_optimizer.step()\n self.disagreement_ema = 0.995 * self.disagreement_ema + 0.005 * (q1 - q2).abs().mean().item()\n\n actor_loss_val, bc_loss_val, q_gap_val = 0.0, 0.0, (q1 - q2).abs().mean().item()\n if self.total_it % self.policy_frequency == 0:\n pi = self.actor(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n q_lcb = torch.min(q1_pi, q2_pi) - self.actor_lcb_scale * (q1_pi - q2_pi).abs()\n\n with torch.no_grad():\n q1_data = self.qf1(obs, actions).view(-1)\n q2_data = self.qf2(obs, actions).view(-1)\n q_data_lcb = torch.min(q1_data, q2_data) - self.actor_lcb_scale * (q1_data - q2_data).abs()\n adv = q_data_lcb - q_lcb.detach()\n adv = adv / (adv.abs().mean() + 1e-6)\n weights = torch.exp(torch.clamp(adv, min=0.0, max=4.0)) - 1.0\n\n bc_err = ((pi - actions) / self.max_action).pow(2).mean(dim=1)\n bc_loss = (weights * bc_err).sum() / (weights.sum() + 1e-6)\n q_coef = self.actor_bc_alpha / (q_lcb.abs().mean().detach() + 1e-6)\n actor_loss = bc_loss - q_coef * q_lcb.mean()\n\n self.actor_optimizer.zero_grad()\n actor_loss.backward()\n self.actor_optimizer.step()\n actor_loss_val = actor_loss.item()\n bc_loss_val = bc_loss.item()\n\n soft_update(self.target_actor, self.actor, self.tau)\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n\n return {\n \"critic_loss\": critic_loss.item(),\n \"actor_loss\": actor_loss_val,\n \"bc_loss\": bc_loss_val,\n \"q_gap\": q_gap_val,\n }",
5 "after_line": 244,
6 "start_line": 153,
7 "end_line": 244
8}OK: Replaced lines 153..244 with 146 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 153–298.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 153–298 | total: 379 lines]
150| # =====================================================================
151| # EDITABLE: Network definitions and OffPolicyAlgorithm
152| # =====================================================================
153 class Actor(nn.Module):
154 """Deterministic actor used by a lower-confidence, self-anchored TD3 variant."""
155
156 def __init__(self, obs_dim, action_dim, max_action):
157 super().__init__()
158 self.net = _mlp_factory(obs_dim, action_dim)
159 self.register_buffer("action_scale", torch.tensor(max_action, dtype=torch.float32))
160
161 def forward(self, obs):
162 return torch.tanh(self.net(obs)) * self.action_scale
163
164 @torch.no_grad()
165
Tool Input
1{
2 "is_final": false,
3 "use_last": false
4}Tool Input
1{
2 "is_final": false,
3 "use_last": false
4}## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=627000 critic_loss=32.7578 actor_loss=-657.9374 alpha=0.3014 alpha_loss=0.1669 TRAIN_METRICS step=628000 critic_loss=51.5589 actor_loss=-656.1498 alpha=0.3014 alpha_loss=0.0844 TRAIN_METRICS step=629000 critic_loss=28.9583 actor_loss=-674.4934 alpha=0.2979 alpha_loss=-0.0080 TRAIN_METRICS step=630000 critic_loss=34.5233 actor_loss=-655.2038 alpha=0.2986 alpha_loss=0.0966 TRAIN_METRICS step=631000 critic_loss=21.5168 actor_loss=-665.9606 alpha=0.3002 alpha_loss=-0.0348 Evaluation (halfcheetah-v4): Eval episodic_return: 10176.58 Eval episodic_return: 10093.58 Eval episodic_return: 10134.89 Final eval return: 10134.89 ### reacher-v4 (scripts/reacher.sh) Training metrics (last steps): TRAIN_METRICS step=683000 critic_loss=0.0032 actor_loss=10.6085 alpha=0.0347 alpha_loss=0.0033 TRAIN_METRICS step=684000 critic_loss=0.0031 actor_loss=10.6704 alpha=0.0346 alpha_loss=-0.0023 TRAIN_METRI