Agent Conversation: anthropic/claude-opus-4.6
Task: rl-onpolicy-continuous
Online RL: On-Policy Actor-Critic for Continuous Control
Objective
Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.
Background
On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_onpolicy_continuous.py [EDITABLE — lines 175–221 only]
1: # Custom on-policy continuous RL algorithm for MLS-Bench
2: #
3: # FIXED sections: config, env, utilities, network architecture, training loop.
4: # EDITABLE section: get_action_and_value method and compute_losses function.
5: import copy
6: import os
7: import random
8: import time
9: from dataclasses import dataclass
10:
11: import gymnasium as gym
12: import numpy as np
13: import torch
14: import torch.nn as nn
15: import torch.nn.functional as F
16: import torch.optim as optim
17: import tyro
18: from torch.distributions.normal import Normal
19:
20:
21: # =====================================================================
22: # FIXED: Configuration
23: # =====================================================================
24: @dataclass
25: class Args:
26: exp_name: str = os.path.basename(__file__)[: -len(".py")]
27: """the name of this experiment"""
28: seed: int = 1
29: """seed of the experiment"""
30: torch_deterministic: bool = True
31: """if toggled, `torch.backends.cudnn.deterministic=False`"""
32: cuda: bool = True
33: """if toggled, cuda will be enabled by default"""
34:
35: # Algorithm specific arguments
36: env_id: str = "HalfCheetah-v4"
37: """the id of the environment"""
38: total_timesteps: int = 1000000
39: """total timesteps of the experiments"""
40: learning_rate: float = 3e-4
41: """the lTool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4 "start_line": 175,
5 "end_line": 221,
6 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n\n # Random Perturbation Optimization (RPO): inject uniform noise into\n # the action mean before constructing the Gaussian. Each forward pass\n # sees a different perturbation, which randomizes the importance-\n # sampling ratio and implicitly enlarges the exploration region.\n _rpo_alpha = 0.5\n z = torch.FloatTensor(action_mean.shape).uniform_(\n -_rpo_alpha, _rpo_alpha\n ).to(action_mean.device)\n action_mean = action_mean + z\n\n probs = Normal(action_mean, action_std)\n if action is None:\n action = probs.sample()\n return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Stochastic Trust-Region PPO (STP-PPO).\n\n Combines RPO exploration (via perturbed action means) with:\n 1. PPO clipped surrogate objective\n 2. Clipped value-function loss\n 3. Advantage-calibrated entropy: entropy bonus is modulated per sample\n so that states with negative advantages (under-performing) receive a\n larger entropy push (explore more) while states with positive\n advantages (well-performing) get a smaller push (exploit more).\n \"\"\"\n # Hardcoded hyperparameters for our novel components\n _ent_coef = 0.005 # base entropy coefficient (overrides default 0.0)\n _adv_ent_modulation = 0.4 # how much advantages modulate entropy weight\n\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # ---- Policy loss: PPO clipped surrogate ----\n pg_loss1 = -mb_advantages * ratio\n pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n pg_loss = torch.max(pg_loss1, pg_loss2).mean()\n\n # ---- Value loss: clipped ----\n newvalue = newvalue.view(-1)\n if args.clip_vloss:\n v_loss_unclipped = (newvalue - mb_returns) ** 2\n v_clipped = mb_values + torch.clamp(\n newvalue - mb_values, -args.clip_coef, args.clip_coef\n )\n v_loss_clipped = (v_clipped - mb_returns) ** 2\n v_loss = 0.5 * torch.max(v_loss_unclipped, v_loss_clipped).mean()\n else:\n v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n\n # ---- Advantage-Calibrated Entropy (ACE) ----\n # States with negative normalised advantages -> weight > 1 (explore more)\n # States with positive normalised advantages -> weight < 1 (exploit more)\n with torch.no_grad():\n ace_weight = 1.0 - _adv_ent_modulation * torch.tanh(mb_advantages)\n entropy_loss = (entropy * ace_weight).mean()\n\n loss = pg_loss - _ent_coef * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac\n\n\n"
7}OK: Replaced lines 175..221 with 73 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–247. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–247 | total: 412 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 180 # Random Perturbation Optimization (RPO): inject uniform noise into 181 # the action mean before constructing the Gaussian. Each forward pass 182 # sees a different perturbation, which randomizes the importance- 183 # sampling ratio and implicitly enlarg
[Test #1] (2 tests remaining; call test(submit=1) to submit this result) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=0.0032 vf_loss=0.0544 entropy=16.8484 approx_kl=0.0190 clipfrac=0.3201 TRAIN_METRICS step=993280 pg_loss=0.0025 vf_loss=0.0088 entropy=17.3134 approx_kl=0.0200 clipfrac=0.3445 TRAIN_METRICS step=995328 pg_loss=0.0406 vf_loss=0.0066 entropy=17.4849 approx_kl=0.0278 clipfrac=0.3287 TRAIN_METRICS step=997376 pg_loss=-0.0369 vf_loss=0.0069 entropy=17.4268 approx_kl=0.0306 clipfrac=0.3330 TRAIN_METRICS step=999424 pg_loss=0.0156 vf_loss=0.0084 entropy=17.5236 approx_kl=0.0309 clipfrac=0.3234 Evaluation (halfcheetah-v4): Eval episodic_return: 871.24 Eval episodic_return: 897.59 Eval episodic_return: 885.73 Final eval return: 885.73 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0018 vf_loss=0.0093 entropy=5.3909 approx_kl=0.0059 clipfrac=0.12
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4 "start_line": 175,
5 "end_line": 247,
6 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n\n # Scale-Adaptive Random Perturbation (SARP):\n # Unlike fixed-scale RPO, perturbation scales with the current policy\n # std so it shrinks as the policy becomes more certain, preventing\n # over-perturbation in precise-control tasks while preserving\n # exploration benefits during early training.\n _sarp_alpha = 0.15\n z = torch.FloatTensor(action_mean.shape).uniform_(-1.0, 1.0).to(action_mean.device)\n action_mean = action_mean + _sarp_alpha * action_std.detach() * z\n\n probs = Normal(action_mean, action_std)\n if action is None:\n action = probs.sample()\n return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Advantage-Weighted PPO with Scale-Adaptive Perturbation (AW-SARP-PPO).\n\n Novel combination:\n 1. SARP exploration in get_action_and_value (std-proportional perturbation)\n 2. Exponential advantage weighting: samples with higher advantages get\n larger gradient weight, focusing policy updates on clearly beneficial\n actions — bridging PPO and AWR philosophies.\n 3. PPO clipped surrogate ensures trust region stability.\n 4. Clipped value function loss.\n \"\"\"\n _awr_beta = 2.0 # advantage weighting temperature\n _awr_max_weight = 10.0 # stability clamp\n\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # ---- Advantage-based importance weights (AWR-style) ----\n # Exponential weighting focuses gradient on high-advantage samples.\n # Since weights > 0, the max(pg1, pg2) factorises cleanly:\n # max(w*f, w*g) = w * max(f, g) => correct PPO clipping behaviour preserved.\n with torch.no_grad():\n weights = torch.exp(mb_advantages / _awr_beta)\n weights = torch.clamp(weights, max=_awr_max_weight)\n weights = weights / (weights.mean() + 1e-8) # normalise to mean 1\n\n # ---- Policy loss: weighted PPO clipped surrogate ----\n pg_loss1 = -mb_advantages * ratio * weights\n pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef) * weights\n pg_loss = torch.max(pg_loss1, pg_loss2).mean()\n\n # ---- Value loss: clipped ----\n newvalue = newvalue.view(-1)\n if args.clip_vloss:\n v_loss_unclipped = (newvalue - mb_returns) ** 2\n v_clipped = mb_values + torch.clamp(\n newvalue - mb_values, -args.clip_coef, args.clip_coef\n )\n v_loss_clipped = (v_clipped - mb_returns) ** 2\n v_loss = 0.5 * torch.max(v_loss_unclipped, v_loss_clipped).mean()\n else:\n v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n\n entropy_loss = entropy.mean()\n loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac\n\n\n"
7}OK: Replaced lines 175..247 with 74 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–248. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–248 | total: 413 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 180 # Scale-Adaptive Random Perturbation (SARP): 181 # Unlike fixed-scale RPO, perturbation scales with the current policy 182 # std so it shrinks as the policy becomes more certain, preventing 183 # over-perturbation in precise-control tasks while preservin
[Test #2] (1 test remaining; call test(submit=2) to submit this result) ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.1989 vf_loss=0.0720 entropy=-6.3696 approx_kl=0.0489 clipfrac=0.4721 TRAIN_METRICS step=993280 pg_loss=-0.2516 vf_loss=0.0111 entropy=-6.3698 approx_kl=0.0409 clipfrac=0.4806 TRAIN_METRICS step=995328 pg_loss=-0.2549 vf_loss=0.0057 entropy=-6.3701 approx_kl=0.0581 clipfrac=0.4722 TRAIN_METRICS step=997376 pg_loss=-0.2049 vf_loss=0.0043 entropy=-6.3706 approx_kl=0.0555 clipfrac=0.4667 TRAIN_METRICS step=999424 pg_loss=-0.1728 vf_loss=0.0061 entropy=-6.3708 approx_kl=0.0543 clipfrac=0.4606 Evaluation (halfcheetah-v4): Eval episodic_return: 1699.68 Eval episodic_return: 1712.04 Eval episodic_return: 1708.78 Final eval return: 1708.78 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.5514 vf_loss=0.0000 entropy=-7.2521 approx_kl=0.0168 clipfrac=0.2220 TR
Tool Input
1{
2 "op": "replace",
3 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4 "start_line": 175,
5 "end_line": 248,
6 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n probs = Normal(action_mean, action_std)\n if action is None:\n action = probs.sample()\n return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Asymmetric-Clip PPO with Unconstrained Value Learning (ACU-PPO).\n\n Novel contributions over standard PPO:\n 1. Asymmetric policy clipping: positive advantages use a wider upper clip\n (1 + 2*epsilon) allowing more aggressive exploitation of discovered\n good actions, while negative advantages retain standard symmetric clip\n for conservative avoidance of harmful actions. This addresses PPO's\n over-conservative treatment of high-advantage transitions without\n sacrificing trust-region stability for negative-advantage corrections.\n 2. Unconstrained value loss: removes the value-function clipping used in\n standard PPO. Value clipping can impede value accuracy by preventing\n necessary corrections; removing it allows faster convergence to\n accurate value estimates that improve advantage signal quality.\n \"\"\"\n _pos_clip_mult = 2.0 # upper clip multiplier for positive advantages\n\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # ---- Asymmetric clipping ----\n clip_lo = 1.0 - args.clip_coef\n clip_hi_pos = 1.0 + args.clip_coef * _pos_clip_mult # wider for A >= 0\n clip_hi_neg = 1.0 + args.clip_coef # standard for A < 0\n\n ratio_clip_pos = torch.clamp(ratio, clip_lo, clip_hi_pos)\n ratio_clip_neg = torch.clamp(ratio, clip_lo, clip_hi_neg)\n ratio_clipped = torch.where(mb_advantages >= 0, ratio_clip_pos, ratio_clip_neg)\n\n pg_loss1 = -mb_advantages * ratio\n pg_loss2 = -mb_advantages * ratio_clipped\n pg_loss = torch.max(pg_loss1, pg_loss2).mean()\n\n # ---- Unconstrained value loss (plain MSE) ----\n newvalue = newvalue.view(-1)\n v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n\n entropy_loss = entropy.mean()\n loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac\n\n\n"
7}OK: Replaced lines 175..248 with 58 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–232. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–232 | total: 397 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 probs = Normal(action_mean, action_std) 180 if action is None: 181 action = probs.sample() 182 return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs) 183 184 185 def compute_losses(agent, mb_obs, mb_actions, mb_lo
[Test #3] (0 tests remaining; auto-submitted as final) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0053 vf_loss=0.1047 entropy=6.3897 approx_kl=0.0009 clipfrac=0.0000 TRAIN_METRICS step=993280 pg_loss=-0.0065 vf_loss=0.0136 entropy=6.3902 approx_kl=0.0006 clipfrac=0.0000 TRAIN_METRICS step=995328 pg_loss=-0.0060 vf_loss=0.0083 entropy=6.3904 approx_kl=0.0005 clipfrac=0.0000 TRAIN_METRICS step=997376 pg_loss=0.0032 vf_loss=0.0075 entropy=6.3901 approx_kl=0.0003 clipfrac=0.0000 TRAIN_METRICS step=999424 pg_loss=-0.0015 vf_loss=0.0132 entropy=6.3903 approx_kl=0.0001 clipfrac=0.0000 Evaluation (halfcheetah-v4): Eval episodic_return: 1203.89 Eval episodic_return: 1186.70 Eval episodic_return: 1232.93 Final eval return: 1232.93 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0024 vf_loss=0.0185 entropy=5.1436 approx_kl=0.0004 clipfrac=0.0000 TRAIN_METRICS