Agent Conversation: openai/gpt-5.4-pro
Task: rl-offpolicy-continuous
Online RL: Off-Policy Actor-Critic for Continuous Control
Objective
Design and implement an off-policy actor-critic RL algorithm for continuous control. Your code goes in custom_offpolicy_continuous.py. Three reference implementations (DDPG, TD3, SAC) are provided as read-only.
Background
Off-policy methods maintain a replay buffer of past experience and update the policy using data collected under previous policies. Key challenges include overestimation bias in Q-value estimates, exploration-exploitation tradeoff, and sample efficiency. Different approaches address these through twin critics, entropy regularization, target smoothing, or delayed updates.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_offpolicy_continuous.py [EDITABLE — lines 153–244 only]
1: # Custom off-policy continuous RL algorithm for MLS-Bench
2: #
3: # EDITABLE section: Actor, QNetwork, and OffPolicyAlgorithm classes.
4: # FIXED sections: everything else (config, env, buffer, eval, training loop).
5: import os
6: import random
7: import time
8: from dataclasses import dataclass
9:
10: import gymnasium as gym
11: import numpy as np
12: import torch
13: import torch.nn as nn
14: import torch.nn.functional as F
15: import torch.optim as optim
16: import tyro
17:
18:
19: # =====================================================================
20: # FIXED: Configuration
21: # =====================================================================
22: @dataclass
23: class Args:
24: exp_name: str = os.path.basename(__file__)[: -len(".py")]
25: """the name of this experiment"""
26: seed: int = 1
27: """seed of the experiment"""
28: torch_deterministic: bool = True
29: """if toggled, `torch.backends.cudnn.deterministic=False`"""
30: cuda: bool = True
31: """if toggled, cuda will be enabled by default"""
32:
33: # Algorithm specific arguments
34: env_id: str = "HalfCheetah-v4"
35: """the id of the environment"""
36: total_timesteps: int = 1000000
37: """total timesteps of the experiments"""
38: learning_rate: float = 3e-4
39: """the learning rate of the optimizer"""
40: buffer_size: int = int(1e6)
41: """the replay memory buffer size"""
42: gamma: floaTool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "class Actor(nn.Module):\n \"\"\"Deterministic tanh actor used for training and evaluation.\"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action):\n super().__init__()\n self.max_action = max_action\n self.fc1 = nn.Linear(obs_dim, 256)\n self.fc2 = nn.Linear(256, 256)\n self.fc_mu = nn.Linear(256, action_dim)\n self.register_buffer(\"action_scale\", torch.tensor(max_action, dtype=torch.float32))\n\n def forward(self, obs):\n x = F.relu(self.fc1(obs))\n x = F.relu(self.fc2(x))\n return torch.tanh(self.fc_mu(x)) * self.action_scale\n\n @torch.no_grad()\n def get_action(self, obs):\n return self.forward(obs)\n\n\nclass QNetwork(nn.Module):\n \"\"\"Q-function Q(s, a) -> scalar.\"\"\"\n\n def __init__(self, obs_dim, action_dim):\n super().__init__()\n self.fc1 = nn.Linear(obs_dim + action_dim, 256)\n self.fc2 = nn.Linear(256, 256)\n self.fc3 = nn.Linear(256, 1)\n\n def forward(self, obs, action):\n x = torch.cat([obs, action], dim=-1)\n x = F.relu(self.fc1(x))\n x = F.relu(self.fc2(x))\n return self.fc3(x)\n\n\nclass OffPolicyAlgorithm:\n \"\"\"UATD3: uncertainty-adaptive TD3 with gated behavior anchoring.\n\n Core ideas:\n 1) Adaptive soft-clipped targets: interpolate between min and max target-Q only\n when the critics agree, reducing TD3's underestimation while staying robust.\n 2) Advantage-gated anchoring: if replay actions look better than the current\n policy at a state, softly pull the actor toward them; this fades over time.\n 3) Neighborhood policy improvement: optimize a blend of Q(pi) and Q(pi+eps)\n to prefer actions that remain good under small perturbations.\n \"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.exploration_noise = args.exploration_noise\n self.policy_frequency = args.policy_frequency\n self.policy_noise = 0.2\n self.noise_clip = 0.5\n self.target_mix = 0.25\n self.actor_alpha = 2.5\n self.anchor_decay = 200000.0\n self.robust_noise = 0.05\n self.total_it = 0\n self.uncertainty_ema = 0.25\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor.load_state_dict(self.actor.state_dict())\n\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()), lr=args.learning_rate\n )\n\n def select_action(self, obs):\n \"\"\"Exploration noise scales mildly with running critic disagreement.\"\"\"\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action = self.actor(obs_t).cpu().numpy().flatten()\n noise_mult = float(np.clip(0.5 + 2.0 * self.uncertainty_ema, 0.5, 1.5))\n noise = np.random.normal(\n 0.0, self.max_action * self.exploration_noise * noise_mult, size=action.shape\n )\n return np.clip(action + noise, -self.max_action, self.max_action)\n\n def update(self, batch):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n\n with torch.no_grad():\n noise = (torch.randn_like(actions) * self.policy_noise).clamp(\n -self.noise_clip, self.noise_clip\n ) * self.max_action\n next_actions = (self.target_actor(next_obs) + noise).clamp(\n -self.max_action, self.max_action\n )\n target_q1 = self.qf1_target(next_obs, next_actions).view(-1)\n target_q2 = self.qf2_target(next_obs, next_actions).view(-1)\n q_min = torch.minimum(target_q1, target_q2)\n q_max = torch.maximum(target_q1, target_q2)\n gap = (target_q1 - target_q2).abs()\n rel_gap = gap / (0.5 * (target_q1.abs() + target_q2.abs()) + 1.0)\n trust = torch.exp(-rel_gap)\n blended_target_q = q_min + self.target_mix * trust * (q_max - q_min)\n td_target = rewards + (1 - dones) * self.gamma * blended_target_q\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = F.smooth_l1_loss(q1, td_target) + F.smooth_l1_loss(q2, td_target)\n\n self.q_optimizer.zero_grad()\n critic_loss.backward()\n self.q_optimizer.step()\n\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n self.uncertainty_ema = 0.99 * self.uncertainty_ema + 0.01 * rel_gap.mean().item()\n\n actor_loss_val = 0.0\n anchor_loss_val = 0.0\n if self.total_it % self.policy_frequency == 0:\n pi = self.actor(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n q_pi = torch.minimum(q1_pi, q2_pi)\n\n smooth_noise = (torch.randn_like(pi) * self.robust_noise).clamp(\n -2 * self.robust_noise, 2 * self.robust_noise\n ) * self.max_action\n pi_perturbed = (pi + smooth_noise).clamp(-self.max_action, self.max_action)\n q_pi_perturbed = torch.minimum(\n self.qf1(obs, pi_perturbed), self.qf2(obs, pi_perturbed)\n ).view(-1)\n robust_q = 0.75 * q_pi + 0.25 * q_pi_perturbed\n\n with torch.no_grad():\n q_data = torch.minimum(self.qf1(obs, actions), self.qf2(obs, actions)).view(-1)\n adv_gap = (q_data - q_pi.detach()) / (\n 0.5 * (q_data.abs() + q_pi.detach().abs()) + 1.0\n )\n pi_rel_gap = (q1_pi.detach() - q2_pi.detach()).abs() / (\n 0.5 * (q1_pi.detach().abs() + q2_pi.detach().abs()) + 1.0\n )\n anchor_weight = torch.sigmoid(4.0 * adv_gap) * (1.0 + pi_rel_gap)\n anchor_scale = 1.0 / (1.0 + self.total_it / self.anchor_decay)\n\n anchor_loss = (\n anchor_weight * ((pi - actions) / self.max_action).pow(2).mean(dim=-1)\n ).mean()\n q_lambda = self.actor_alpha / (q_pi.detach().abs().mean() + 1e-6)\n actor_loss = -q_lambda * robust_q.mean() + anchor_scale * anchor_loss\n\n self.actor_optimizer.zero_grad()\n actor_loss.backward()\n self.actor_optimizer.step()\n soft_update(self.target_actor, self.actor, self.tau)\n\n actor_loss_val = actor_loss.item()\n anchor_loss_val = anchor_loss.item()\n\n return {\n \"critic_loss\": critic_loss.item(),\n \"actor_loss\": actor_loss_val,\n \"anchor_loss\": anchor_loss_val,\n \"uncertainty\": float(self.uncertainty_ema),\n }\n",
5 "after_line": 244,
6 "start_line": 153,
7 "end_line": 244
8}OK: Replaced lines 153..244 with 173 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 153–325.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 153–325 | total: 406 lines]
150| # =====================================================================
151| # EDITABLE: Network definitions and OffPolicyAlgorithm
152| # =====================================================================
153 class Actor(nn.Module):
154 """Deterministic tanh actor used for training and evaluation."""
155
156 def __init__(self, obs_dim, action_dim, max_action):
157 super().__init__()
158 self.max_action = max_action
159 self.fc1 = nn.Linear(obs_dim, 256)
160 self.fc2 = nn.Linear(256, 256)
161 self.fc_mu = nn.Linear(256, action_dim)
162 self.register_buffer("action_scale", torch.tensor(max_action, dtype=torch.float32))
163
164 def forward(sel
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=896000 critic_loss=3.3085 actor_loss=-2.4938 anchor_loss=0.0231 uncertainty=0.0082 TRAIN_METRICS step=897000 critic_loss=3.9425 actor_loss=-2.4888 anchor_loss=0.0216 uncertainty=0.0083 TRAIN_METRICS step=898000 critic_loss=4.0854 actor_loss=-2.4902 anchor_loss=0.0206 uncertainty=0.0084 TRAIN_METRICS step=899000 critic_loss=4.0657 actor_loss=-2.4910 anchor_loss=0.0169 uncertainty=0.0081 TRAIN_METRICS step=900000 critic_loss=3.8973 actor_loss=-2.4900 anchor_loss=0.0226 uncertainty=0.0086 Evaluation (halfcheetah-v4): Eval episodic_return: 4996.11 Eval episodic_return: 5909.64 Eval episodic_return: 4900.97 Final eval return: 4900.97 ### reacher-v4 (scripts/reacher.sh) Training metrics (last steps): TRAIN_METRICS step=995000 critic_loss=0.0093 actor_loss=2.5010 anchor_loss=0.0036 uncertainty=0.0020 TRAIN_METRICS st
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "class OffPolicyAlgorithm:\n \"\"\"AW-TD3: Agreement-Weighted TD3.\n\n Novel contribution: instead of always backing up with hard min(Q1, Q2), use an\n agreement-weighted soft clip. When twin critics strongly agree, we move the\n target and actor objective slightly above the TD3 minimum to reduce systematic\n underestimation; when they disagree, we recover the conservative TD3 backup.\n \"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.exploration_noise = args.exploration_noise\n self.policy_frequency = args.policy_frequency\n self.policy_noise = 0.2\n self.noise_clip = 0.5\n self.target_mix = 0.4\n self.actor_mix = 0.3\n self.agreement_temp = 2.0\n self.total_it = 0\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor.load_state_dict(self.actor.state_dict())\n\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()), lr=args.learning_rate\n )\n\n def _agreement_blend(self, q1, q2, mix):\n q_min = torch.minimum(q1, q2)\n q_max = torch.maximum(q1, q2)\n rel_gap = (q1 - q2).abs() / (0.5 * (q1.abs() + q2.abs()) + 1.0)\n agreement = torch.exp(-self.agreement_temp * rel_gap)\n soft_q = q_min + mix * agreement * (q_max - q_min)\n return soft_q, agreement\n\n def select_action(self, obs):\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action = self.actor(obs_t).cpu().numpy().flatten()\n noise = np.random.normal(0, self.max_action * self.exploration_noise, size=action.shape)\n return np.clip(action + noise, -self.max_action, self.max_action)\n\n def update(self, batch):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n\n with torch.no_grad():\n noise = (torch.randn_like(actions) * self.policy_noise).clamp(\n -self.noise_clip, self.noise_clip\n ) * self.max_action\n next_actions = (self.target_actor(next_obs) + noise).clamp(\n -self.max_action, self.max_action\n )\n target_q1 = self.qf1_target(next_obs, next_actions).view(-1)\n target_q2 = self.qf2_target(next_obs, next_actions).view(-1)\n target_q, target_agreement = self._agreement_blend(\n target_q1, target_q2, self.target_mix\n )\n td_target = rewards + (1 - dones) * self.gamma * target_q\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = F.mse_loss(q1, td_target) + F.mse_loss(q2, td_target)\n\n self.q_optimizer.zero_grad()\n critic_loss.backward()\n self.q_optimizer.step()\n\n actor_loss_val = 0.0\n actor_agreement_val = 0.0\n if self.total_it % self.policy_frequency == 0:\n pi = self.actor(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n actor_q, actor_agreement = self._agreement_blend(q1_pi, q2_pi, self.actor_mix)\n actor_loss = -actor_q.mean()\n\n self.actor_optimizer.zero_grad()\n actor_loss.backward()\n self.actor_optimizer.step()\n\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n soft_update(self.target_actor, self.actor, self.tau)\n\n actor_loss_val = actor_loss.item()\n actor_agreement_val = actor_agreement.mean().item()\n\n return {\n \"critic_loss\": critic_loss.item(),\n \"actor_loss\": actor_loss_val,\n \"target_agreement\": target_agreement.mean().item(),\n \"actor_agreement\": actor_agreement_val,\n }\n",
5 "after_line": 325,
6 "start_line": 190,
7 "end_line": 325
8}OK: Replaced lines 190..325 with 106 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 153–295.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 153–295 | total: 376 lines]
150| # =====================================================================
151| # EDITABLE: Network definitions and OffPolicyAlgorithm
152| # =====================================================================
153 class Actor(nn.Module):
154 """Deterministic tanh actor used for training and evaluation."""
155
156 def __init__(self, obs_dim, action_dim, max_action):
157 super().__init__()
158 self.max_action = max_action
159 self.fc1 = nn.Linear(obs_dim, 256)
160 self.fc2 = nn.Linear(256, 256)
161 self.fc_mu = nn.Linear(256, action_dim)
162 self.register_buffer("action_scale", torch.tensor(max_action, dtype=torch.float32))
163
164 def forward(sel
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=85000 critic_loss=11.9597 actor_loss=-146.3225 target_agreement=0.9478 actor_agreement=0.9483 TRAIN_METRICS step=86000 critic_loss=12.4670 actor_loss=-152.5188 target_agreement=0.9568 actor_agreement=0.9566 TRAIN_METRICS step=87000 critic_loss=13.4277 actor_loss=-154.7044 target_agreement=0.9572 actor_agreement=0.9539 TRAIN_METRICS step=88000 critic_loss=23.2070 actor_loss=-158.5827 target_agreement=0.9558 actor_agreement=0.9575 TRAIN_METRICS step=89000 critic_loss=13.0975 actor_loss=-164.4578 target_agreement=0.9624 actor_agreement=0.9644 Evaluation (halfcheetah-v4): Eval episodic_return: 2475.02 Eval episodic_return: 3220.57 Eval episodic_return: 1869.43 Final eval return: 1869.43 ### reacher-v4 (scripts/reacher.sh) Training metrics (last steps): TRAIN_METRICS step=95000 critic_loss=0.0318 actor_loss=10.8688 target_agreem
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_offpolicy_continuous.py",
4 "content": "class OffPolicyAlgorithm:\n \"\"\"Soft-Clipped Policy TD3.\n\n Novel contribution: keep the standard TD3 critic target, but replace the actor\n update with a soft-clipped double-Q objective. The policy maximizes a convex\n combination of min(Q1, Q2) and max(Q1, Q2), which is equivalent to maximizing\n the average value while applying an implicit disagreement penalty. This keeps\n the TD3 safety benefits while reducing sensitivity to a single critic's errors.\n \"\"\"\n\n def __init__(self, obs_dim, action_dim, max_action, device, args):\n self.device = device\n self.max_action = max_action\n self.gamma = args.gamma\n self.tau = args.tau\n self.exploration_noise = args.exploration_noise\n self.policy_frequency = args.policy_frequency\n self.policy_noise = 0.2\n self.noise_clip = 0.5\n self.actor_mix = 0.25\n self.total_it = 0\n\n self.actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor = Actor(obs_dim, action_dim, max_action).to(device)\n self.target_actor.load_state_dict(self.actor.state_dict())\n\n self.qf1 = QNetwork(obs_dim, action_dim).to(device)\n self.qf2 = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf2_target = QNetwork(obs_dim, action_dim).to(device)\n self.qf1_target.load_state_dict(self.qf1.state_dict())\n self.qf2_target.load_state_dict(self.qf2.state_dict())\n\n self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=args.learning_rate)\n self.q_optimizer = optim.Adam(\n list(self.qf1.parameters()) + list(self.qf2.parameters()),\n lr=args.learning_rate,\n )\n\n def _soft_clipped(self, q1, q2, mix):\n q_min = torch.minimum(q1, q2)\n q_max = torch.maximum(q1, q2)\n return q_min + mix * (q_max - q_min)\n\n def select_action(self, obs):\n obs_t = torch.tensor(obs.reshape(1, -1), device=self.device, dtype=torch.float32)\n with torch.no_grad():\n action = self.actor(obs_t).cpu().numpy().flatten()\n noise = np.random.normal(0, self.max_action * self.exploration_noise, size=action.shape)\n return np.clip(action + noise, -self.max_action, self.max_action)\n\n def update(self, batch):\n self.total_it += 1\n obs, next_obs, actions, rewards, dones = batch\n\n with torch.no_grad():\n noise = (torch.randn_like(actions) * self.policy_noise).clamp(\n -self.noise_clip, self.noise_clip\n ) * self.max_action\n next_actions = (self.target_actor(next_obs) + noise).clamp(\n -self.max_action, self.max_action\n )\n target_q1 = self.qf1_target(next_obs, next_actions).view(-1)\n target_q2 = self.qf2_target(next_obs, next_actions).view(-1)\n td_target = rewards + (1 - dones) * self.gamma * torch.min(target_q1, target_q2)\n\n q1 = self.qf1(obs, actions).view(-1)\n q2 = self.qf2(obs, actions).view(-1)\n critic_loss = F.mse_loss(q1, td_target) + F.mse_loss(q2, td_target)\n\n self.q_optimizer.zero_grad()\n critic_loss.backward()\n self.q_optimizer.step()\n\n actor_loss_val = 0.0\n disagreement_val = 0.0\n if self.total_it % self.policy_frequency == 0:\n pi = self.actor(obs)\n q1_pi = self.qf1(obs, pi).view(-1)\n q2_pi = self.qf2(obs, pi).view(-1)\n actor_q = self._soft_clipped(q1_pi, q2_pi, self.actor_mix)\n actor_loss = -actor_q.mean()\n\n self.actor_optimizer.zero_grad()\n actor_loss.backward()\n self.actor_optimizer.step()\n\n soft_update(self.qf1_target, self.qf1, self.tau)\n soft_update(self.qf2_target, self.qf2, self.tau)\n soft_update(self.target_actor, self.actor, self.tau)\n\n actor_loss_val = actor_loss.item()\n disagreement_val = (q1_pi - q2_pi).abs().mean().item()\n\n return {\n \"critic_loss\": critic_loss.item(),\n \"actor_loss\": actor_loss_val,\n \"critic_disagreement\": disagreement_val,\n }\n",
5 "after_line": 295,
6 "start_line": 190,
7 "end_line": 295
8}OK: Replaced lines 190..295 with 99 line(s) in cleanrl/cleanrl/custom_offpolicy_continuous.py. Editable range: 153–288.
[Current file: cleanrl/cleanrl/custom_offpolicy_continuous.py | editable: 153–288 | total: 369 lines]
150| # =====================================================================
151| # EDITABLE: Network definitions and OffPolicyAlgorithm
152| # =====================================================================
153 class Actor(nn.Module):
154 """Deterministic tanh actor used for training and evaluation."""
155
156 def __init__(self, obs_dim, action_dim, max_action):
157 super().__init__()
158 self.max_action = max_action
159 self.fc1 = nn.Linear(obs_dim, 256)
160 self.fc2 = nn.Linear(256, 256)
161 self.fc_mu = nn.Linear(256, action_dim)
162 self.register_buffer("action_scale", torch.tensor(max_action, dtype=torch.float32))
163
164 def forward(self
[Test #3] (0 tests remaining) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=766000 critic_loss=26.8812 actor_loss=-739.4603 critic_disagreement=1.6009 TRAIN_METRICS step=767000 critic_loss=29.5062 actor_loss=-742.9932 critic_disagreement=1.7002 TRAIN_METRICS step=768000 critic_loss=28.6160 actor_loss=-739.6460 critic_disagreement=1.5442 TRAIN_METRICS step=769000 critic_loss=34.4093 actor_loss=-734.1824 critic_disagreement=1.5390 TRAIN_METRICS step=770000 critic_loss=27.3246 actor_loss=-732.7402 critic_disagreement=1.6616 Evaluation (halfcheetah-v4): Eval episodic_return: 10255.59 Eval episodic_return: 10060.25 Eval episodic_return: 10150.78 Final eval return: 10150.78 ### reacher-v4 (scripts/reacher.sh) Training metrics (last steps): TRAIN_METRICS step=875000 critic_loss=0.0231 actor_loss=12.5891 critic_disagreement=0.0201 TRAIN_METRICS step=876000 critic_loss=0.0183 actor_loss=12.5299 critic_disagreement=0.0213 TRAIN_METRI