Agent Conversation: google/gemini-3.1-pro-preview
Task: rl-onpolicy-continuous
Online RL: On-Policy Actor-Critic for Continuous Control
Objective
Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.
Background
On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
cleanrl/cleanrl/custom_onpolicy_continuous.py [EDITABLE — lines 175–221 only]
1: # Custom on-policy continuous RL algorithm for MLS-Bench
2: #
3: # FIXED sections: config, env, utilities, network architecture, training loop.
4: # EDITABLE section: get_action_and_value method and compute_losses function.
5: import copy
6: import os
7: import random
8: import time
9: from dataclasses import dataclass
10:
11: import gymnasium as gym
12: import numpy as np
13: import torch
14: import torch.nn as nn
15: import torch.nn.functional as F
16: import torch.optim as optim
17: import tyro
18: from torch.distributions.normal import Normal
19:
20:
21: # =====================================================================
22: # FIXED: Configuration
23: # =====================================================================
24: @dataclass
25: class Args:
26: exp_name: str = os.path.basename(__file__)[: -len(".py")]
27: """the name of this experiment"""
28: seed: int = 1
29: """seed of the experiment"""
30: torch_deterministic: bool = True
31: """if toggled, `torch.backends.cudnn.deterministic=False`"""
32: cuda: bool = True
33: """if toggled, cuda will be enabled by default"""
34:
35: # Algorithm specific arguments
36: env_id: str = "HalfCheetah-v4"
37: """the id of the environment"""
38: total_timesteps: int = 1000000
39: """total timesteps of the experiments"""
40: learning_rate: float = 3e-4
41: """the lTool Input
1{
2 "start_line": 175,
3 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n probs = Normal(action_mean, action_std)\n\n if action is None:\n u = probs.sample()\n action_out = torch.tanh(u)\n log_prob = probs.log_prob(u) - torch.log(1 - action_out.pow(2) + 1e-6)\n return action_out, log_prob.sum(1), probs.entropy().sum(1), self.critic(obs)\n else:\n eps = 1e-6\n action_clamped = torch.clamp(action, -1.0 + eps, 1.0 - eps)\n u = 0.5 * torch.log((1 + action_clamped) / (1 - action_clamped))\n log_prob = probs.log_prob(u) - torch.log(1 - action_clamped.pow(2) + eps)\n return action, log_prob.sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Dual-Clipped PPO with Squashed Normal and Huber Value Loss.\"\"\"\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # Policy loss: Dual Clipping\n pg_loss1 = -mb_advantages * ratio\n pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n pg_loss_ppo = torch.max(pg_loss1, pg_loss2)\n\n dual_clip_coef = 3.0\n pg_loss_dual = torch.where(\n mb_advantages < 0,\n torch.min(pg_loss_ppo, -mb_advantages * dual_clip_coef),\n pg_loss_ppo\n )\n pg_loss = pg_loss_dual.mean()\n\n # Value loss: Huber Loss\n newvalue = newvalue.view(-1)\n v_loss = F.huber_loss(newvalue, mb_returns, delta=2.0)\n\n entropy_loss = entropy.mean()\n ent_c = args.ent_coef if args.ent_coef > 0.0 else 0.005\n loss = pg_loss - ent_c * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac",
5 "end_line": 221,
6 "op": "replace"
7}OK: Replaced lines 175..221 with 51 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–225. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–225 | total: 390 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 probs = Normal(action_mean, action_std) 180 181 if action is None: 182 u = probs.sample() 183 action_out = torch.tanh(u) 184 log_prob = probs.log_prob(u) - torch.log(1 - action_out.pow(2) + 1e-6) 185 return action_out,
[Test #1] (2 tests remaining; call test(submit=1) to submit this result) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0146 vf_loss=0.0929 entropy=6.7939 approx_kl=0.0017 clipfrac=0.0001 TRAIN_METRICS step=993280 pg_loss=-0.0061 vf_loss=0.0133 entropy=6.7941 approx_kl=0.0015 clipfrac=0.0002 TRAIN_METRICS step=995328 pg_loss=-0.0079 vf_loss=0.0074 entropy=6.7945 approx_kl=0.0009 clipfrac=0.0000 TRAIN_METRICS step=997376 pg_loss=-0.0003 vf_loss=0.0061 entropy=6.7943 approx_kl=0.0001 clipfrac=0.0000 TRAIN_METRICS step=999424 pg_loss=0.0008 vf_loss=0.0110 entropy=6.7945 approx_kl=0.0000 clipfrac=0.0000 Evaluation (halfcheetah-v4): Eval episodic_return: 1352.09 Eval episodic_return: 1334.14 Eval episodic_return: 1409.70 Final eval return: 1409.70 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=0.0010 vf_loss=0.0193 entropy=3.6491 approx_kl=0.0030 clipfrac=0.0
Tool Input
1{
2 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
3 "end_line": 225,
4 "op": "replace",
5 "start_line": 175,
6 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n probs = Normal(action_mean, action_std)\n\n if action is None:\n action = probs.sample()\n return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Dual-Clipped PPO with Huber Value Loss.\"\"\"\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # Policy loss: Dual Clipping\n pg_loss1 = -mb_advantages * ratio\n pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n pg_loss_ppo = torch.max(pg_loss1, pg_loss2)\n\n dual_clip_coef = 3.0\n # Dual clipping bounds the maximum penalty for taking a bad action\n pg_loss_dual = torch.where(\n mb_advantages < 0,\n torch.min(pg_loss_ppo, -mb_advantages * dual_clip_coef),\n pg_loss_ppo\n )\n pg_loss = pg_loss_dual.mean()\n\n # Value loss: Huber Loss\n newvalue = newvalue.view(-1)\n v_loss = F.huber_loss(newvalue, mb_returns, delta=2.0)\n\n entropy_loss = entropy.mean()\n loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}OK: Replaced lines 175..225 with 43 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–217. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–217 | total: 382 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 probs = Normal(action_mean, action_std) 180 181 if action is None: 182 action = probs.sample() 183 return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs) 184 185 186 def compute_losses(agent, mb_obs, mb_actio
[Test #2] (1 test remaining; call test(submit=2) to submit this result) ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0287 vf_loss=0.0944 entropy=4.3976 approx_kl=0.0031 clipfrac=0.0073 TRAIN_METRICS step=993280 pg_loss=-0.0089 vf_loss=0.0116 entropy=4.3978 approx_kl=0.0014 clipfrac=0.0002 TRAIN_METRICS step=995328 pg_loss=-0.0007 vf_loss=0.0087 entropy=4.3980 approx_kl=0.0009 clipfrac=0.0000 TRAIN_METRICS step=997376 pg_loss=-0.0012 vf_loss=0.0074 entropy=4.3979 approx_kl=0.0004 clipfrac=0.0000 TRAIN_METRICS step=999424 pg_loss=-0.0010 vf_loss=0.0088 entropy=4.3980 approx_kl=0.0001 clipfrac=0.0000 Evaluation (halfcheetah-v4): Eval episodic_return: 1307.42 Eval episodic_return: 1323.09 Eval episodic_return: 1361.87 Final eval return: 1361.87 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0000 vf_loss=0.0125 entropy=2.7351 approx_kl=0.0004 clipfrac=0.0000 TRAIN_ME
Tool Input
1{
2 "content": " def get_action_and_value(self, obs, action=None):\n action_mean = self.actor_mean(obs)\n action_logstd = self.actor_logstd.expand_as(action_mean)\n action_std = torch.exp(action_logstd)\n probs = Normal(action_mean, action_std)\n\n if action is None:\n action = probs.sample()\n return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n \"\"\"Dual-Clipped PPO with Huber Value Loss and Centered Advantages.\"\"\"\n _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n logratio = newlogprob - mb_logprobs\n ratio = logratio.exp()\n\n with torch.no_grad():\n approx_kl = ((ratio - 1) - logratio).mean()\n clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n\n # Policy loss: Dual Clipping\n pg_loss1 = -mb_advantages * ratio\n pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n pg_loss_ppo = torch.max(pg_loss1, pg_loss2)\n\n dual_clip_coef = 3.0\n # Dual clipping bounds the maximum penalty for taking a bad action\n pg_loss_dual = torch.where(\n mb_advantages < 0,\n torch.min(pg_loss_ppo, -mb_advantages * dual_clip_coef),\n pg_loss_ppo\n )\n pg_loss = pg_loss_dual.mean()\n\n # Value loss: Huber Loss\n newvalue = newvalue.view(-1)\n v_loss = F.huber_loss(newvalue, mb_returns, delta=2.0)\n\n # Added exploration bonus for tasks that prematurely converge\n entropy_loss = entropy.mean()\n loss = pg_loss - 0.005 * entropy_loss + v_loss * args.vf_coef\n\n return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac",
3 "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4 "op": "replace",
5 "start_line": 175,
6 "end_line": 217
7}OK: Replaced lines 175..217 with 44 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–218. [Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–218 | total: 383 lines] 172| # ================================================================= 173| # EDITABLE: get_action_and_value and compute_losses 174| # ================================================================= 175 def get_action_and_value(self, obs, action=None): 176 action_mean = self.actor_mean(obs) 177 action_logstd = self.actor_logstd.expand_as(action_mean) 178 action_std = torch.exp(action_logstd) 179 probs = Normal(action_mean, action_std) 180 181 if action is None: 182 action = probs.sample() 183 return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs) 184 185 186 def compute_losses(agent, mb_obs, mb_actio
[Test #3] (0 tests remaining; auto-submitted as final) ## Seed 42 ### halfcheetah-v4 (scripts/halfcheetah.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0030 vf_loss=0.0937 entropy=7.5974 approx_kl=0.0017 clipfrac=0.0012 TRAIN_METRICS step=993280 pg_loss=-0.0056 vf_loss=0.0175 entropy=7.5980 approx_kl=0.0008 clipfrac=0.0000 TRAIN_METRICS step=995328 pg_loss=0.0001 vf_loss=0.0072 entropy=7.5984 approx_kl=0.0003 clipfrac=0.0000 TRAIN_METRICS step=997376 pg_loss=-0.0032 vf_loss=0.0074 entropy=7.5983 approx_kl=0.0001 clipfrac=0.0000 TRAIN_METRICS step=999424 pg_loss=-0.0010 vf_loss=0.0127 entropy=7.5984 approx_kl=0.0000 clipfrac=0.0000 Evaluation (halfcheetah-v4): Eval episodic_return: 1171.42 Eval episodic_return: 1176.75 Eval episodic_return: 1174.94 Final eval return: 1174.94 ### swimmer-v4 (scripts/swimmer.sh) Training metrics (last steps): TRAIN_METRICS step=991232 pg_loss=-0.0019 vf_loss=0.0145 entropy=3.6457 approx_kl=0.0004 clipfrac=0.0000 TRAIN_METRICS