Agent Conversation: deepseek-reasoner

Model: deepseek-reasoner
Total steps: 10
Messages: 21
UserStep 0

Task: rl-onpolicy-continuous

Online RL: On-Policy Actor-Critic for Continuous Control

Objective

Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.

Background

On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified
  • Total parameter count is enforced at runtime
  • Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

cleanrl/cleanrl/custom_onpolicy_continuous.py [EDITABLE — lines 175–221 only]

     1: # Custom on-policy continuous RL algorithm for MLS-Bench
     2: #
     3: # FIXED sections: config, env, utilities, network architecture, training loop.
     4: # EDITABLE section: get_action_and_value method and compute_losses function.
     5: import copy
     6: import os
     7: import random
     8: import time
     9: from dataclasses import dataclass
    10: 
    11: import gymnasium as gym
    12: import numpy as np
    13: import torch
    14: import torch.nn as nn
    15: import torch.nn.functional as F
    16: import torch.optim as optim
    17: import tyro
    18: from torch.distributions.normal import Normal
    19: 
    20: 
    21: # =====================================================================
    22: # FIXED: Configuration
    23: # =====================================================================
    24: @dataclass
    25: class Args:
    26:     exp_name: str = os.path.basename(__file__)[: -len(".py")]
    27:     """the name of this experiment"""
    28:     seed: int = 1
    29:     """seed of the experiment"""
    30:     torch_deterministic: bool = True
    31:     """if toggled, `torch.backends.cudnn.deterministic=False`"""
    32:     cuda: bool = True
    33:     """if toggled, cuda will be enabled by default"""
    34: 
    35:     # Algorithm specific arguments
    36:     env_id: str = "HalfCheetah-v4"
    37:     """the id of the environment"""
    38:     total_timesteps: int = 1000000
    39:     """total timesteps of the experiments"""
    40:     learning_rate: float = 3e-4
    41:     """the l
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4  "start_line": 175,
5  "end_line": 221,
6  "content": "    def get_action_and_value(self, obs, action=None):\n        # Squashed Gaussian policy with optional parameter noise for exploration\n        action_mean = self.actor_mean(obs)\n        action_logstd = self.actor_logstd.expand_as(action_mean)\n        action_std = torch.exp(action_logstd)\n        \n        # Add adaptive exploration noise to action mean\n        if not hasattr(self, '_noise_scale'):\n            self._noise_scale = 0.1  # initial noise scale\n        if self.training:\n            # Add noise scaled by current noise scale\n            noise = torch.randn_like(action_mean) * self._noise_scale\n            action_mean = action_mean + noise\n        \n        # Sample from underlying Gaussian\n        normal = Normal(action_mean, action_std)\n        if action is None:\n            raw_action = normal.sample()\n        else:\n            # If action provided (during loss computation), invert tanh\n            raw_action = torch.atanh(torch.clamp(action, -0.999, 0.999))\n        \n        # Apply tanh squashing\n        action = torch.tanh(raw_action)\n        \n        # Compute log probability with change of variables\n        log_prob_raw = normal.log_prob(raw_action)\n        # Jacobian correction: log(1 - tanh^2(raw_action))\n        log_det_jac = torch.log(1 - torch.tanh(raw_action) ** 2 + 1e-8)\n        log_prob = (log_prob_raw - log_det_jac).sum(1)\n        \n        # Compute entropy approximation: H(raw) + E[log_det_jac]\n        entropy_raw = normal.entropy().sum(1)\n        entropy = entropy_raw + log_det_jac.sum(1)\n        \n        return action, log_prob, entropy, self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n    \"\"\"Adaptive Trust Region Policy Optimization with Exploration Noise (ATRPO-EN)\n    \n    Combines clipped surrogate with adaptive KL penalty and adaptive entropy regularization.\n    \"\"\"\n    # Initialize adaptive parameters if not present\n    if not hasattr(agent, '_kl_beta'):\n        agent._kl_beta = 0.5\n        agent._target_kl = 0.01\n        agent._entropy_target = -0.5 * mb_actions.shape[1]  # target entropy per dimension (heuristic)\n        agent._entropy_coef = args.ent_coef\n        agent._avg_entropy = 0.0\n        agent._entropy_decay = 0.99\n    \n    _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n    logratio = newlogprob - mb_logprobs\n    ratio = logratio.exp()\n    \n    # KL divergence (with gradient for penalty)\n    kl = ((ratio - 1) - logratio).mean()\n    \n    with torch.no_grad():\n        approx_kl = kl.detach()\n        clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n    \n    # Adaptive clip coefficient based on KL divergence\n    with torch.no_grad():\n        kl_ratio = approx_kl / agent._target_kl\n        adaptive_clip_coef = args.clip_coef * (1.0 + torch.tanh(kl_ratio - 1.0))\n        adaptive_clip_coef = torch.clamp(adaptive_clip_coef, args.clip_coef * 0.5, args.clip_coef * 2.0)\n    \n    # Hybrid policy loss: clipped surrogate + KL penalty\n    pg_loss1 = -mb_advantages * ratio\n    pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - adaptive_clip_coef, 1 + adaptive_clip_coef)\n    pg_loss_clip = torch.max(pg_loss1, pg_loss2).mean()\n    pg_loss = pg_loss_clip + agent._kl_beta * kl\n    \n    # Adapt KL penalty coefficient (similar to PPO-Penalty)\n    with torch.no_grad():\n        if approx_kl > 1.5 * agent._target_kl:\n            agent._kl_beta = min(agent._kl_beta * 2.0, 100.0)\n        elif approx_kl < agent._target_kl / 1.5:\n            agent._kl_beta = max(agent._kl_beta / 2.0, 1e-4)\n    \n    # Update exploration noise scale based on policy improvement\n    with torch.no_grad():\n        # If policy update is too small (ratio close to 1), increase exploration\n        avg_ratio = ratio.mean().item()\n        if avg_ratio > 0.95:\n            agent._noise_scale = min(agent._noise_scale * 1.05, 0.5)\n        elif avg_ratio < 0.8:\n            agent._noise_scale = max(agent._noise_scale * 0.95, 0.01)\n    \n    # Adaptive entropy regularization\n    with torch.no_grad():\n        agent._avg_entropy = agent._entropy_decay * agent._avg_entropy + (1 - agent._entropy_decay) * entropy.mean().item()\n        entropy_gap = agent._avg_entropy - agent._entropy_target\n        # Adjust entropy coefficient to drive entropy towards target\n        if entropy_gap < -0.1:  # entropy too low\n            agent._entropy_coef = min(agent._entropy_coef * 1.1, 1.0)\n        elif entropy_gap > 0.1:  # entropy too high\n            agent._entropy_coef = max(agent._entropy_coef * 0.9, 0.001)\n    \n    # Value loss — clipped (as in PPO)\n    newvalue = newvalue.view(-1)\n    if args.clip_vloss:\n        v_loss_unclipped = (newvalue - mb_returns) ** 2\n        v_clipped = mb_values + torch.clamp(\n            newvalue - mb_values,\n            -args.clip_coef,\n            args.clip_coef,\n        )\n        v_loss_clipped = (v_clipped - mb_returns) ** 2\n        v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)\n        v_loss = 0.5 * v_loss_max.mean()\n    else:\n        v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n    \n    entropy_loss = entropy.mean()\n    loss = pg_loss - agent._entropy_coef * entropy_loss + v_loss * args.vf_coef\n    \n    return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}
Tool ResultStep 1
OK: Replaced lines 175..221 with 120 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–294.

[Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–294 | total: 459 lines]
   172|     # =================================================================
   173|     # EDITABLE: get_action_and_value and compute_losses
   174|     # =================================================================
   175      def get_action_and_value(self, obs, action=None):
   176          # Squashed Gaussian policy with optional parameter noise for exploration
   177          action_mean = self.actor_mean(obs)
   178          action_logstd = self.actor_logstd.expand_as(action_mean)
   179          action_std = torch.exp(action_logstd)
   180          
   181          # Add adaptive exploration noise to action mean
   182          if not hasattr(self, '_noise_scale'):
   183              self._noise_scale = 0.1  # initial noise scale
   184          if self
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)


## Seed 42

### halfcheetah-v4 (scripts/halfcheetah.sh)
Training metrics (last steps):
TRAIN_METRICS step=991232 pg_loss=6.2034 vf_loss=0.0623 entropy=-4.4854 approx_kl=0.0624 clipfrac=0.4275
TRAIN_METRICS step=993280 pg_loss=8.8679 vf_loss=0.0210 entropy=-6.0074 approx_kl=0.0888 clipfrac=0.4227
TRAIN_METRICS step=995328 pg_loss=7.8042 vf_loss=0.0445 entropy=-5.2355 approx_kl=0.0777 clipfrac=0.3513
TRAIN_METRICS step=997376 pg_loss=13.0593 vf_loss=0.0325 entropy=-4.2510 approx_kl=0.1304 clipfrac=0.4118
TRAIN_METRICS step=999424 pg_loss=11.5457 vf_loss=0.0303 entropy=-5.0827 approx_kl=0.1153 clipfrac=0.5952
Evaluation (halfcheetah-v4):
Eval episodic_return: -180.95
Eval episodic_return: -197.75
Eval episodic_return: -149.87
Final eval return: -149.87

### swimmer-v4 (scripts/swimmer.sh)
Training metrics (last steps):
TRAIN_METRICS step=991232 pg_loss=59.0375 vf_loss=0.0103 entropy=-1.9298 approx_kl=0.5882 clipfrac=0
AssistantStep 3undo
Tool Input
1{
2  "n": 1
3}
Tool ResultStep 3
Undo complete:
Restored: cleanrl/cleanrl/custom_onpolicy_continuous.py
AssistantStep 4edit
Tool Input
1{
2  "op": "replace",
3  "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4  "start_line": 175,
5  "end_line": 221,
6  "content": "    def get_action_and_value(self, obs, action=None):\n        action_mean = self.actor_mean(obs)\n        action_logstd = self.actor_logstd.expand_as(action_mean)\n        action_std = torch.exp(action_logstd)\n        probs = Normal(action_mean, action_std)\n        if action is None:\n            action = probs.sample()\n        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n    \"\"\"Advantage-Weighted Clipped Policy Optimization (AWCPO)\n    \n    Combines advantage-weighted regression with clipped surrogate objective.\n    \"\"\"\n    # Initialize adaptive temperature parameter\n    if not hasattr(agent, '_beta'):\n        agent._beta = 0.1  # temperature for advantage weighting\n        agent._target_weight_std = 0.5  # target standard deviation of normalized weights\n    \n    _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n    logratio = newlogprob - mb_logprobs\n    ratio = logratio.exp()\n    \n    with torch.no_grad():\n        approx_kl = ((ratio - 1) - logratio).mean()\n        clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n    \n    # Compute advantage weights\n    with torch.no_grad():\n        # temperature-scaled advantages\n        scaled_adv = mb_advantages / agent._beta\n        # shift for numerical stability (subtract max)\n        max_scaled = scaled_adv.max()\n        weights = torch.exp(scaled_adv - max_scaled)\n        # normalize weights to have mean 1 (preserve scale of gradient)\n        weights = weights / (weights.mean() + 1e-8)\n        \n        # Compute weight statistics for adaptation\n        weight_std = weights.std()\n        weight_entropy = -(weights * torch.log(weights + 1e-8)).mean()\n        \n        # Adapt temperature beta to control weight dispersion\n        if weight_std > agent._target_weight_std * 1.5:\n            agent._beta = agent._beta * 1.05  # increase beta to flatten weights\n        elif weight_std < agent._target_weight_std * 0.5:\n            agent._beta = agent._beta * 0.95  # decrease beta to sharpen weights\n        agent._beta = max(agent._beta, 0.01)\n        agent._beta = min(agent._beta, 10.0)\n    \n    # Weighted clipped surrogate policy loss\n    pg_loss1 = -weights * mb_advantages * ratio\n    pg_loss2 = -weights * mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n    pg_loss = torch.max(pg_loss1, pg_loss2).mean()\n    \n    # Add a small penalty on weight entropy to prevent degenerate weights (encourage diversity)\n    weight_entropy_penalty = 0.01 * weight_entropy\n    pg_loss = pg_loss + weight_entropy_penalty\n    \n    # Value loss — clipped (as in PPO)\n    newvalue = newvalue.view(-1)\n    if args.clip_vloss:\n        v_loss_unclipped = (newvalue - mb_returns) ** 2\n        v_clipped = mb_values + torch.clamp(\n            newvalue - mb_values,\n            -args.clip_coef,\n            args.clip_coef,\n        )\n        v_loss_clipped = (v_clipped - mb_returns) ** 2\n        v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)\n        v_loss = 0.5 * v_loss_max.mean()\n    else:\n        v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n    \n    entropy_loss = entropy.mean()\n    loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef\n    \n    return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}
Tool ResultStep 4
OK: Replaced lines 175..221 with 78 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–252.

[Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–252 | total: 417 lines]
   172|     # =================================================================
   173|     # EDITABLE: get_action_and_value and compute_losses
   174|     # =================================================================
   175      def get_action_and_value(self, obs, action=None):
   176          action_mean = self.actor_mean(obs)
   177          action_logstd = self.actor_logstd.expand_as(action_mean)
   178          action_std = torch.exp(action_logstd)
   179          probs = Normal(action_mean, action_std)
   180          if action is None:
   181              action = probs.sample()
   182          return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)
   183  
   184  
   185  def compute_losses(agent, mb_obs, mb_actions, mb_lo
AssistantStep 5test
Tool ResultStep 5
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### halfcheetah-v4 (scripts/halfcheetah.sh)
Training metrics (last steps):
TRAIN_METRICS step=991232 pg_loss=-0.2363 vf_loss=0.0895 entropy=-12.2993 approx_kl=0.0061 clipfrac=0.1118
TRAIN_METRICS step=993280 pg_loss=-0.3140 vf_loss=0.0103 entropy=-12.2995 approx_kl=0.0063 clipfrac=0.0729
TRAIN_METRICS step=995328 pg_loss=-0.2727 vf_loss=0.0055 entropy=-12.2996 approx_kl=0.0047 clipfrac=0.0493
TRAIN_METRICS step=997376 pg_loss=-0.4571 vf_loss=0.0035 entropy=-12.2997 approx_kl=0.0046 clipfrac=0.0254
TRAIN_METRICS step=999424 pg_loss=-0.2748 vf_loss=0.0083 entropy=-12.2998 approx_kl=0.0027 clipfrac=0.0406
Evaluation (halfcheetah-v4):
Eval episodic_return: 1403.91
Eval episodic_return: 1406.55
Eval episodic_return: 1405.23
Final eval return: 1405.23

### swimmer-v4 (scripts/swimmer.sh)
Training metrics (last steps):
TRAIN_METRICS step=991232 pg_loss=-0.2097 vf_loss=0.0052 entropy=-7.2274 approx_kl=0.0069 clipfrac=0.1017

AssistantStep 6edit
Tool Input
1{
2  "op": "replace",
3  "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4  "start_line": 175,
5  "end_line": 221,
6  "content": "    def get_action_and_value(self, obs, action=None):\n        action_mean = self.actor_mean(obs)\n        action_logstd = self.actor_logstd.expand_as(action_mean)\n        action_std = torch.exp(action_logstd)\n        probs = Normal(action_mean, action_std)\n        if action is None:\n            action = probs.sample()\n        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n    \"\"\"Advantage-Weighted Clipped Policy Optimization (AWCPO) with enhanced exploration.\n    \n    Combines advantage-weighted regression with clipped surrogate objective.\n    Adds entropy regularization and adaptive weight tuning.\n    \"\"\"\n    # Initialize adaptive parameters\n    if not hasattr(agent, '_beta'):\n        agent._beta = 0.2  # temperature for advantage weighting\n        agent._target_weight_std = 0.3  # target standard deviation of normalized weights\n        agent._entropy_coef = 0.01  # additional entropy coefficient (added to args.ent_coef)\n        agent._target_entropy = -0.5 * mb_actions.shape[1]  # heuristic target entropy per dimension\n    \n    _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n    logratio = newlogprob - mb_logprobs\n    ratio = logratio.exp()\n    \n    with torch.no_grad():\n        approx_kl = ((ratio - 1) - logratio).mean()\n        clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n    \n    # Clip advantages for numerical stability\n    clipped_advantages = torch.clamp(mb_advantages, -10.0, 10.0)\n    \n    # Compute advantage weights\n    with torch.no_grad():\n        # temperature-scaled advantages\n        scaled_adv = clipped_advantages / agent._beta\n        # shift for numerical stability (subtract max)\n        max_scaled = scaled_adv.max()\n        weights = torch.exp(scaled_adv - max_scaled)\n        # normalize weights to have mean 1 (preserve scale of gradient)\n        weights = weights / (weights.mean() + 1e-8)\n        \n        # Compute weight statistics for adaptation\n        weight_std = weights.std()\n        weight_entropy = -(weights * torch.log(weights + 1e-8)).mean()\n        \n        # Adapt temperature beta to control weight dispersion\n        if weight_std > agent._target_weight_std * 1.5:\n            agent._beta = agent._beta * 1.05  # increase beta to flatten weights\n        elif weight_std < agent._target_weight_std * 0.5:\n            agent._beta = agent._beta * 0.95  # decrease beta to sharpen weights\n        agent._beta = max(agent._beta, 0.05)\n        agent._beta = min(agent._beta, 5.0)\n        \n        # Adapt entropy coefficient to maintain target entropy\n        entropy_mean = entropy.mean().item()\n        if entropy_mean < agent._target_entropy - 0.2:\n            agent._entropy_coef = min(agent._entropy_coef * 1.1, 0.1)\n        elif entropy_mean > agent._target_entropy + 0.2:\n            agent._entropy_coef = max(agent._entropy_coef * 0.9, 0.001)\n    \n    # Weighted clipped surrogate policy loss\n    pg_loss1 = -weights * clipped_advantages * ratio\n    pg_loss2 = -weights * clipped_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n    pg_loss = torch.max(pg_loss1, pg_loss2).mean()\n    \n    # Add a small penalty on weight entropy to prevent degenerate weights (encourage diversity)\n    weight_entropy_penalty = 0.005 * weight_entropy\n    pg_loss = pg_loss + weight_entropy_penalty\n    \n    # Value loss — clipped (as in PPO)\n    newvalue = newvalue.view(-1)\n    if args.clip_vloss:\n        v_loss_unclipped = (newvalue - mb_returns) ** 2\n        v_clipped = mb_values + torch.clamp(\n            newvalue - mb_values,\n            -args.clip_coef,\n            args.clip_coef,\n        )\n        v_loss_clipped = (v_clipped - mb_returns) ** 2\n        v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)\n        v_loss = 0.5 * v_loss_max.mean()\n    else:\n        v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n    \n    entropy_loss = entropy.mean()\n    # Total entropy coefficient = original + adaptive\n    total_ent_coef = args.ent_coef + agent._entropy_coef\n    loss = pg_loss - total_ent_coef * entropy_loss + v_loss * args.vf_coef\n    \n    return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}
Tool ResultStep 6
OK: Replaced lines 175..221 with 93 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–298.

[Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–298 | total: 463 lines]
   172|     # =================================================================
   173|     # EDITABLE: get_action_and_value and compute_losses
   174|     # =================================================================
   175      def get_action_and_value(self, obs, action=None):
   176          action_mean = self.actor_mean(obs)
   177          action_logstd = self.actor_logstd.expand_as(action_mean)
   178          action_std = torch.exp(action_logstd)
   179          probs = Normal(action_mean, action_std)
   180          if action is None:
   181              action = probs.sample()
   182          return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)
   183  
   184  
   185  def compute_losses(agent, mb_obs, mb_actions, mb_lo
AssistantStep 7undo
Tool Input
1{
2  "n": 1
3}
Tool ResultStep 7
Undo complete:
Restored: cleanrl/cleanrl/custom_onpolicy_continuous.py
AssistantStep 8edit
Tool Input
1{
2  "op": "replace",
3  "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4  "start_line": 175,
5  "end_line": 221,
6  "content": "    def get_action_and_value(self, obs, action=None):\n        action_mean = self.actor_mean(obs)\n        action_logstd = self.actor_logstd.expand_as(action_mean)\n        action_std = torch.exp(action_logstd)\n        probs = Normal(action_mean, action_std)\n        if action is None:\n            action = probs.sample()\n        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n    \"\"\"Relative Policy Optimization (RPO): advantage-weighted clipped surrogate with adaptive trust region.\"\"\"\n    # Initialize adaptive parameters\n    if not hasattr(agent, '_beta'):\n        agent._beta = 0.2          # temperature for advantage weighting\n        agent._kl_beta = 0.5       # KL penalty coefficient\n        agent._target_kl = 0.01    # target KL divergence\n    \n    _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n    logratio = newlogprob - mb_logprobs\n    ratio = logratio.exp()\n    \n    # KL divergence (with gradient for penalty)\n    kl = ((ratio - 1) - logratio).mean()\n    \n    with torch.no_grad():\n        approx_kl = kl.detach()\n        clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n    \n    # Advantage weighting\n    with torch.no_grad():\n        # clip advantages to avoid extreme weights\n        clipped_adv = torch.clamp(mb_advantages, -10.0, 10.0)\n        weights = torch.exp(clipped_adv / agent._beta)\n        weights = weights / (weights.mean() + 1e-8)\n        # adapt temperature beta to keep weight std ~0.5\n        weight_std = weights.std()\n        if weight_std > 0.7:\n            agent._beta = min(agent._beta * 1.05, 5.0)\n        elif weight_std < 0.3:\n            agent._beta = max(agent._beta * 0.95, 0.05)\n    \n    # Weighted clipped surrogate policy loss + KL penalty\n    pg_loss1 = -weights * mb_advantages * ratio\n    pg_loss2 = -weights * mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n    pg_loss_clip = torch.max(pg_loss1, pg_loss2).mean()\n    pg_loss = pg_loss_clip + agent._kl_beta * kl\n    \n    # Adapt KL penalty coefficient\n    with torch.no_grad():\n        if approx_kl > 1.5 * agent._target_kl:\n            agent._kl_beta = min(agent._kl_beta * 2.0, 100.0)\n        elif approx_kl < agent._target_kl / 1.5:\n            agent._kl_beta = max(agent._kl_beta / 2.0, 1e-4)\n    \n    # Value loss — clipped (as in PPO)\n    newvalue = newvalue.view(-1)\n    if args.clip_vloss:\n        v_loss_unclipped = (newvalue - mb_returns) ** 2\n        v_clipped = mb_values + torch.clamp(\n            newvalue - mb_values,\n            -args.clip_coef,\n            args.clip_coef,\n        )\n        v_loss_clipped = (v_clipped - mb_returns) ** 2\n        v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)\n        v_loss = 0.5 * v_loss_max.mean()\n    else:\n        v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n    \n    # Entropy bonus (fixed coefficient 0.01) added to default args.ent_coef\n    entropy_loss = entropy.mean()\n    total_ent_coef = args.ent_coef + 0.01\n    loss = pg_loss - total_ent_coef * entropy_loss + v_loss * args.vf_coef\n    \n    return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}
Tool ResultStep 8
OK: Replaced lines 175..221 with 76 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–281.

[Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–281 | total: 446 lines]
   172|     # =================================================================
   173|     # EDITABLE: get_action_and_value and compute_losses
   174|     # =================================================================
   175      def get_action_and_value(self, obs, action=None):
   176          action_mean = self.actor_mean(obs)
   177          action_logstd = self.actor_logstd.expand_as(action_mean)
   178          action_std = torch.exp(action_logstd)
   179          probs = Normal(action_mean, action_std)
   180          if action is None:
   181              action = probs.sample()
   182          return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)
   183  
   184  
   185  def compute_losses(agent, mb_obs, mb_actions, mb_lo
AssistantStep 9edit
Tool Input
1{
2  "op": "replace",
3  "filename": "cleanrl/cleanrl/custom_onpolicy_continuous.py",
4  "start_line": 175,
5  "end_line": 281,
6  "content": "    def get_action_and_value(self, obs, action=None):\n        action_mean = self.actor_mean(obs)\n        action_logstd = self.actor_logstd.expand_as(action_mean)\n        action_std = torch.exp(action_logstd)\n        probs = Normal(action_mean, action_std)\n        if action is None:\n            action = probs.sample()\n        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)\n\n\ndef compute_losses(agent, mb_obs, mb_actions, mb_logprobs, mb_advantages, mb_returns, mb_values, args):\n    \"\"\"Relative Policy Optimization (RPO): advantage-weighted clipped surrogate with adaptive trust region.\"\"\"\n    # Initialize adaptive parameters\n    if not hasattr(agent, '_beta'):\n        agent._beta = 0.2          # temperature for advantage weighting\n        agent._kl_beta = 0.5       # KL penalty coefficient\n        agent._target_kl = 0.01    # target KL divergence\n    \n    _, newlogprob, entropy, newvalue = agent.get_action_and_value(mb_obs, mb_actions)\n    logratio = newlogprob - mb_logprobs\n    ratio = logratio.exp()\n    \n    # KL divergence (with gradient for penalty)\n    kl = ((ratio - 1) - logratio).mean()\n    \n    with torch.no_grad():\n        approx_kl = kl.detach()\n        clipfrac = ((ratio - 1.0).abs() > args.clip_coef).float().mean().item()\n    \n    # Advantage weighting\n    with torch.no_grad():\n        # clip advantages to avoid extreme weights\n        clipped_adv = torch.clamp(mb_advantages, -10.0, 10.0)\n        weights = torch.exp(clipped_adv / agent._beta)\n        weights = weights / (weights.mean() + 1e-8)\n        # adapt temperature beta to keep weight std ~0.5\n        weight_std = weights.std()\n        if weight_std > 0.7:\n            agent._beta = min(agent._beta * 1.05, 5.0)\n        elif weight_std < 0.3:\n            agent._beta = max(agent._beta * 0.95, 0.05)\n    \n    # Weighted clipped surrogate policy loss + KL penalty\n    pg_loss1 = -weights * mb_advantages * ratio\n    pg_loss2 = -weights * mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)\n    pg_loss_clip = torch.max(pg_loss1, pg_loss2).mean()\n    pg_loss = pg_loss_clip + agent._kl_beta * kl\n    \n    # Adapt KL penalty coefficient\n    with torch.no_grad():\n        if approx_kl > 1.5 * agent._target_kl:\n            agent._kl_beta = min(agent._kl_beta * 2.0, 100.0)\n        elif approx_kl < agent._target_kl / 1.5:\n            agent._kl_beta = max(agent._kl_beta / 2.0, 1e-4)\n    \n    # Value loss — clipped (as in PPO)\n    newvalue = newvalue.view(-1)\n    if args.clip_vloss:\n        v_loss_unclipped = (newvalue - mb_returns) ** 2\n        v_clipped = mb_values + torch.clamp(\n            newvalue - mb_values,\n            -args.clip_coef,\n            args.clip_coef,\n        )\n        v_loss_clipped = (v_clipped - mb_returns) ** 2\n        v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)\n        v_loss = 0.5 * v_loss_max.mean()\n    else:\n        v_loss = 0.5 * ((newvalue - mb_returns) ** 2).mean()\n    \n    # Entropy bonus (fixed coefficient 0.01) added to default args.ent_coef\n    entropy_loss = entropy.mean()\n    total_ent_coef = args.ent_coef + 0.01\n    loss = pg_loss - total_ent_coef * entropy_loss + v_loss * args.vf_coef\n    \n    return loss, pg_loss, v_loss, entropy_loss, approx_kl, clipfrac"
7}
Tool ResultStep 9
OK: Replaced lines 175..281 with 76 line(s) in cleanrl/cleanrl/custom_onpolicy_continuous.py. Editable range: 175–250.

[Current file: cleanrl/cleanrl/custom_onpolicy_continuous.py | editable: 175–250 | total: 415 lines]
   172|     # =================================================================
   173|     # EDITABLE: get_action_and_value and compute_losses
   174|     # =================================================================
   175      def get_action_and_value(self, obs, action=None):
   176          action_mean = self.actor_mean(obs)
   177          action_logstd = self.actor_logstd.expand_as(action_mean)
   178          action_std = torch.exp(action_logstd)
   179          probs = Normal(action_mean, action_std)
   180          if action is None:
   181              action = probs.sample()
   182          return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(obs)
   183  
   184  
   185  def compute_losses(agent, mb_obs, mb_actions, mb_lo
AssistantStep 10test