Agent Conversation: anthropic/claude-opus-4.6
Task: rl-offline-adroit
Offline RL: Dexterous Manipulation with Narrow Expert Data (Adroit)
Objective
Design and implement an offline RL algorithm for high-dimensional dexterous manipulation from narrow human demonstration data (~25 demos). Your code goes in custom_adroit.py. Three reference implementations (BC-10%, AWAC, ReBRAC) are provided as read-only.
Background
Adroit tasks involve a 24-DoF robotic hand with high-dimensional action spaces (24-30 dims) and narrow human-v1 datasets containing only ~25 human teleoperation demonstrations, creating severe distribution shift compared to locomotion tasks.
Constraints
- Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A
_mlp()factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256. - Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on Pen (rotation), Door (opening), Hammer (nailing) using Adroit human-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).
CORL/algorithms/offline/custom_adroit.py [EDITABLE — lines 214–416 only]
1: # Custom offline RL algorithm for MLS-Bench — Adroit dexterous manipulation
2: #
3: # EDITABLE section: network definitions + OfflineAlgorithm class.
4: # FIXED sections: everything else (config, utilities, data, eval, training loop).
5: import os
6: import random
7: import uuid
8: from copy import deepcopy
9: from dataclasses import dataclass
10: from typing import Any, Dict, List, Optional, Tuple, Union
11:
12: import d4rl
13: import gym
14: import numpy as np
15: import pyrallis
16: import torch
17: import torch.nn as nn
18: import torch.nn.functional as F
19: from torch.distributions import Normal, TanhTransform, TransformedDistribution
20:
21: TensorBatch = List[torch.Tensor]
22:
23:
24: # =====================================================================
25: # FIXED: Configuration
26: # batch_size, eval_freq, n_episodes, max_timesteps are enforced here.
27: # =====================================================================
28: @dataclass
29: class TrainConfig:
30: device: str = "cuda"
31: env: str = "pen-human-v1"
32: seed: int = 0
33: eval_freq: int = int(5e3)
34: n_episodes: int = 10
35: max_timesteps: int = int(1e6)
36: checkpoints_path: Optional[str] = None
37: buffer_size: int Tool Input
1{
2 "op": "replace",
3 "filename": "CORL/algorithms/offline/custom_adroit.py",
4 "start_line": 214,
5 "end_line": 416,
6 "content": "CONFIG_OVERRIDES: Dict[str, Any] = {}\n\n\nclass DeterministicActor(nn.Module):\n \"\"\"Deterministic policy pi(s) = tanh(net(s)) * max_action.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float):\n super().__init__()\n self.max_action = max_action\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, action_dim), nn.Tanh(),\n )\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.max_action * self.net(state)\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n return self(state).cpu().data.numpy().flatten()\n\n\nclass Actor(nn.Module):\n \"\"\"Gaussian policy with dropout and state-independent log_std for IQL.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float,\n orthogonal_init: bool = False):\n super().__init__()\n self.max_action = max_action\n self.action_dim = action_dim\n self._mlp = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(), nn.Dropout(0.1),\n nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.1),\n nn.Linear(256, action_dim), nn.Tanh(),\n )\n self._log_std = nn.Parameter(torch.zeros(action_dim, dtype=torch.float32))\n self._min_log_std = -20.0\n self._max_log_std = 2.0\n def _get_policy(self, state: torch.Tensor):\n mean = self._mlp(state)\n log_std = self._log_std.clamp(self._min_log_std, self._max_log_std)\n return Normal(mean, log_std.exp())\n def log_prob(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n action = torch.clamp(action / self.max_action, -1.0 + 1e-6, 1.0 - 1e-6)\n return self._get_policy(state).log_prob(action).sum(-1)\n def forward(self, state: torch.Tensor, deterministic: bool = False):\n policy = self._get_policy(state)\n action = policy.mean if deterministic else policy.rsample()\n action = torch.clamp(action, -1.0, 1.0)\n return self.max_action * action, policy.log_prob(action).sum(-1)\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n policy = self._get_policy(state)\n action = policy.sample() if self._mlp.training else policy.mean\n action = torch.clamp(self.max_action * action, -self.max_action, self.max_action)\n return action[0].cpu().numpy()\n\n\nclass Critic(nn.Module):\n \"\"\"Twin Q-function. Two 3x256 MLPs.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = False):\n super().__init__()\n self.q1 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n self.q2 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n def both(self, state: torch.Tensor, action: torch.Tensor):\n sa = torch.cat([state, action], dim=-1)\n return self.q1(sa).squeeze(-1), self.q2(sa).squeeze(-1)\n def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n return torch.min(*self.both(state, action))\n\n\nclass ValueFunction(nn.Module):\n \"\"\"V(s). 3x256 MLP.\"\"\"\n def __init__(self, state_dim: int, orthogonal_init: bool = False):\n super().__init__()\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n init_module_weights(self.net, orthogonal_init)\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n \"\"\"Calibrated Advantage IQL (CA-IQL): IQL enhanced with per-batch\n advantage whitening and cosine-annealed temperature for robust\n policy extraction from narrow offline data.\n\n Innovations over standard IQL:\n 1. Per-batch advantage whitening: normalizes A = Q(s,a) - V(s) to\n zero mean / unit variance within each mini-batch before exp\n weighting, making beta invariant to Q-function scale across envs.\n 2. Cosine temperature annealing: beta from beta_min->beta_max gives\n curriculum from BC-like (uniform weights, robust when Q unreliable)\n to selective extraction (focus on high-advantage actions).\n 3. Critic weight decay: L2 regularization prevents Q/V overfitting\n on narrow datasets (~25 demonstrations).\n \"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float,\n replay_buffer=None, discount: float = 0.99, tau: float = 5e-3,\n actor_lr: float = 3e-4, critic_lr: float = 3e-4,\n alpha_lr: float = 3e-4, orthogonal_init: bool = True,\n device: str = \"cuda\"):\n self.device = device\n self.discount = discount\n self.tau = tau\n self.max_action = max_action\n self.total_it = 0\n self.replay_buffer = replay_buffer\n # Adapt expectile to dataset type\n env_name = os.environ.get(\"ENV\", \"\")\n self.iql_tau = 0.9 if \"cloned\" in env_name else 0.7\n # CA-IQL specific hyperparameters\n self.beta_min = 1.0 # initial temperature (BC-like)\n self.beta_max = 5.0 # final temperature (selective)\n self.warmup_steps = 500000 # cosine annealing period\n self.exp_adv_max = 100.0 # clamp for exp weights\n # Networks\n self.actor = Actor(state_dim, action_dim, max_action).to(device)\n self.qf = Critic(state_dim, action_dim).to(device)\n self.qf_target = deepcopy(self.qf)\n self.qf_target.requires_grad_(False)\n self.vf = ValueFunction(state_dim).to(device)\n # Optimizers: weight decay on critics to prevent overfitting\n self.actor_optimizer = torch.optim.Adam(\n self.actor.parameters(), lr=actor_lr)\n self.actor_lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(\n self.actor_optimizer, T_max=int(1e6))\n self.q_optimizer = torch.optim.Adam(\n self.qf.parameters(), lr=critic_lr, weight_decay=5e-5)\n self.v_optimizer = torch.optim.Adam(\n self.vf.parameters(), lr=critic_lr, weight_decay=5e-5)\n\n def _asymmetric_l2_loss(self, u: torch.Tensor, tau: float) -> torch.Tensor:\n return torch.mean(torch.abs(tau - (u < 0).float()) * u ** 2)\n\n def train(self, batch: TensorBatch) -> Dict[str, float]:\n self.total_it += 1\n states, actions, rewards, next_states, dones, *_ = batch\n rewards = rewards.squeeze(-1)\n dones = dones.squeeze(-1)\n log_dict: Dict[str, float] = {}\n # ── V update: expectile regression against Q_target ──\n with torch.no_grad():\n target_q = self.qf_target(states, actions)\n v = self.vf(states)\n v_loss = self._asymmetric_l2_loss(target_q - v, self.iql_tau)\n log_dict[\"value_loss\"] = v_loss.item()\n self.v_optimizer.zero_grad(set_to_none=True)\n v_loss.backward()\n self.v_optimizer.step()\n # ── Q update: Bellman with V(s') bootstrap ──\n with torch.no_grad():\n q_target = rewards + (1.0 - dones) * self.discount * self.vf(next_states)\n q1, q2 = self.qf.both(states, actions)\n q_loss = (F.mse_loss(q1, q_target) + F.mse_loss(q2, q_target)) / 2.0\n log_dict[\"critic_loss\"] = q_loss.item()\n self.q_optimizer.zero_grad(set_to_none=True)\n q_loss.backward()\n self.q_optimizer.step()\n soft_update(self.qf_target, self.qf, self.tau)\n # ── Actor: calibrated advantage-weighted regression ──\n with torch.no_grad():\n adv = target_q - self.vf(states)\n # Innovation 1: per-batch advantage whitening\n adv_whitened = (adv - adv.mean()) / adv.std().clamp(min=1e-3)\n # Innovation 2: cosine-annealed temperature\n t = min(self.total_it / self.warmup_steps, 1.0)\n beta = self.beta_min + 0.5 * (self.beta_max - self.beta_min) * (\n 1.0 - np.cos(np.pi * t))\n weights = torch.exp(beta * adv_whitened).clamp(max=self.exp_adv_max)\n actor_loss = -(self.actor.log_prob(states, actions) * weights).mean()\n log_dict[\"actor_loss\"] = actor_loss.item()\n log_dict[\"beta\"] = beta\n self.actor_optimizer.zero_grad(set_to_none=True)\n actor_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1.0)\n self.actor_optimizer.step()\n self.actor_lr_schedule.step()\n return log_dict\n"
7}OK: Replaced lines 214..416 with 192 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–405.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–405 | total: 514 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {}
215
216
217 class DeterministicActor(nn.Module):
218 """Deterministic policy pi(s) = tanh(net(s)) * max_action."""
219 def __init__(self, state_dim: int, action_dim: int, max_action: float):
220 super().__init__()
221 self.max_action = max_action
222 self.net = nn.Sequential(
223 nn.Linear(state_dim, 256), nn.ReLU(),
224 nn.Linear(256, 256), nn.ReLU(),
225 nn.Linear(256, action_dim), nn.Tanh(),
226 )
227 def forward(self,
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### pen-human-v1 (scripts/pen.sh) Training metrics (last steps): TRAIN_METRICS step=996000 value_loss=2.1005 critic_loss=24.9245 actor_loss=-913.8812 beta=5.0000 TRAIN_METRICS step=997000 value_loss=9.0402 critic_loss=60.2768 actor_loss=-1419.5835 beta=5.0000 TRAIN_METRICS step=998000 value_loss=2.1686 critic_loss=11.8032 actor_loss=-720.5568 beta=5.0000 TRAIN_METRICS step=999000 value_loss=9.5232 critic_loss=31.2563 actor_loss=-664.7784 beta=5.0000 TRAIN_METRICS step=1000000 value_loss=25.5617 critic_loss=153.4018 actor_loss=-1773.1173 beta=5.0000 D4RL evaluation (pen-human-v1): Evaluation over 10 episodes: 3127.496 , D4RL score: 101.700 Evaluation over 10 episodes: 3115.563 , D4RL score: 101.299 Evaluation over 10 episodes: 3116.616 , D4RL score: 101.335 Final D4RL score: 101.335 ### hammer-human-v1 (scripts/hammer.sh) Training metrics (last steps): TRAIN_METRICS step=996000 value_loss=223.4192 criti
Tool Input
1{
2 "op": "replace",
3 "filename": "CORL/algorithms/offline/custom_adroit.py",
4 "start_line": 214,
5 "end_line": 405,
6 "content": "CONFIG_OVERRIDES: Dict[str, Any] = {}\n\n\nclass DeterministicActor(nn.Module):\n \"\"\"Deterministic policy pi(s) = tanh(net(s)) * max_action.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float):\n super().__init__()\n self.max_action = max_action\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, action_dim), nn.Tanh(),\n )\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.max_action * self.net(state)\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n return self(state).cpu().data.numpy().flatten()\n\n\nclass Actor(nn.Module):\n \"\"\"Gaussian policy with dropout and state-independent log_std for IQL.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float,\n orthogonal_init: bool = False):\n super().__init__()\n self.max_action = max_action\n self.action_dim = action_dim\n self._mlp = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(), nn.Dropout(0.1),\n nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.1),\n nn.Linear(256, action_dim), nn.Tanh(),\n )\n self._log_std = nn.Parameter(torch.zeros(action_dim, dtype=torch.float32))\n self._min_log_std = -20.0\n self._max_log_std = 2.0\n def _get_policy(self, state: torch.Tensor):\n mean = self._mlp(state)\n log_std = self._log_std.clamp(self._min_log_std, self._max_log_std)\n return Normal(mean, log_std.exp())\n def log_prob(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n action = torch.clamp(action / self.max_action, -1.0 + 1e-6, 1.0 - 1e-6)\n return self._get_policy(state).log_prob(action).sum(-1)\n def forward(self, state: torch.Tensor, deterministic: bool = False):\n policy = self._get_policy(state)\n action = policy.mean if deterministic else policy.rsample()\n action = torch.clamp(action, -1.0, 1.0)\n return self.max_action * action, policy.log_prob(action).sum(-1)\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n policy = self._get_policy(state)\n action = policy.sample() if self._mlp.training else policy.mean\n action = torch.clamp(self.max_action * action, -self.max_action, self.max_action)\n return action[0].cpu().numpy()\n\n\nclass Critic(nn.Module):\n \"\"\"Twin Q-function. Two 3x256 MLPs.\"\"\"\n def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = False):\n super().__init__()\n self.q1 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n self.q2 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n def both(self, state: torch.Tensor, action: torch.Tensor):\n sa = torch.cat([state, action], dim=-1)\n return self.q1(sa).squeeze(-1), self.q2(sa).squeeze(-1)\n def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n return torch.min(*self.both(state, action))\n\n\nclass ValueFunction(nn.Module):\n \"\"\"V(s). 3x256 MLP.\"\"\"\n def __init__(self, state_dim: int, orthogonal_init: bool = False):\n super().__init__()\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n init_module_weights(self.net, orthogonal_init)\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n \"\"\"Scale-Adaptive IQL (SA-IQL): IQL with safe advantage normalization\n and progressive temperature scheduling.\n\n Innovations over standard IQL:\n 1. Safe Advantage Normalization: A_norm = (A - mean) / max(std, 1.0).\n When advantage spread is large (mixed data like cloned datasets),\n normalizes to prevent extreme weights. When advantage spread is\n small (narrow expert data), falls back to standard IQL behavior.\n This automatic scale adaptation eliminates manual beta tuning\n across environments with different reward/advantage magnitudes.\n 2. Progressive Temperature (BC-to-IQL Curriculum): beta anneals from\n ~0 (uniform weighting = BC) to beta_max via cosine schedule.\n Early training learns from all data equally (robust when Q is\n unreliable); later training selectively focuses on high-advantage\n actions. beta=0 with normalized advantages gives weight=1 for all\n actions, recovering pure behavioral cloning.\n 3. Gradient clipping on critics prevents Q-divergence on narrow\n datasets where bootstrap targets can be noisy.\n \"\"\"\n def __init__(self, state_dim: int, action_dim: int, max_action: float,\n replay_buffer=None, discount: float = 0.99, tau: float = 5e-3,\n actor_lr: float = 3e-4, critic_lr: float = 3e-4,\n alpha_lr: float = 3e-4, orthogonal_init: bool = True,\n device: str = \"cuda\"):\n self.device = device\n self.discount = discount\n self.tau = tau\n self.max_action = max_action\n self.total_it = 0\n self.replay_buffer = replay_buffer\n # Adapt expectile to dataset type\n env_name = os.environ.get(\"ENV\", \"\")\n self.iql_tau = 0.9 if \"cloned\" in env_name else 0.8\n # SA-IQL hyperparameters\n self.beta_min = 0.01 # ~BC at start\n self.beta_max = 3.0 # standard IQL temperature\n self.warmup_steps = 200000 # BC->IQL curriculum period\n self.exp_adv_max = 100.0 # clamp for exp weights\n self.adv_min_std = 1.0 # safe normalization threshold\n # Networks\n self.actor = Actor(state_dim, action_dim, max_action).to(device)\n self.qf = Critic(state_dim, action_dim).to(device)\n self.qf_target = deepcopy(self.qf)\n self.qf_target.requires_grad_(False)\n self.vf = ValueFunction(state_dim).to(device)\n # Optimizers\n self.actor_optimizer = torch.optim.Adam(\n self.actor.parameters(), lr=actor_lr)\n self.actor_lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(\n self.actor_optimizer, T_max=int(1e6))\n self.q_optimizer = torch.optim.Adam(\n self.qf.parameters(), lr=critic_lr)\n self.v_optimizer = torch.optim.Adam(\n self.vf.parameters(), lr=critic_lr)\n\n def _asymmetric_l2_loss(self, u: torch.Tensor, tau: float) -> torch.Tensor:\n return torch.mean(torch.abs(tau - (u < 0).float()) * u ** 2)\n\n def train(self, batch: TensorBatch) -> Dict[str, float]:\n self.total_it += 1\n states, actions, rewards, next_states, dones, *_ = batch\n rewards = rewards.squeeze(-1)\n dones = dones.squeeze(-1)\n log_dict: Dict[str, float] = {}\n # ── V update: expectile regression against Q_target ──\n with torch.no_grad():\n target_q = self.qf_target(states, actions)\n v = self.vf(states)\n v_loss = self._asymmetric_l2_loss(target_q - v, self.iql_tau)\n log_dict[\"value_loss\"] = v_loss.item()\n self.v_optimizer.zero_grad(set_to_none=True)\n v_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.vf.parameters(), 10.0)\n self.v_optimizer.step()\n # ── Q update: Bellman with V(s') bootstrap ──\n with torch.no_grad():\n q_target = rewards + (1.0 - dones) * self.discount * self.vf(next_states)\n q1, q2 = self.qf.both(states, actions)\n q_loss = (F.mse_loss(q1, q_target) + F.mse_loss(q2, q_target)) / 2.0\n log_dict[\"critic_loss\"] = q_loss.item()\n self.q_optimizer.zero_grad(set_to_none=True)\n q_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.qf.parameters(), 10.0)\n self.q_optimizer.step()\n soft_update(self.qf_target, self.qf, self.tau)\n # ── Actor: scale-adaptive advantage-weighted regression ──\n with torch.no_grad():\n adv = target_q - self.vf(states)\n # Innovation 1: safe normalization (only scales down large advs)\n adv_safe = (adv - adv.mean()) / torch.clamp(adv.std(), min=self.adv_min_std)\n # Innovation 2: progressive temperature (BC->IQL curriculum)\n t = min(self.total_it / self.warmup_steps, 1.0)\n beta = self.beta_min + 0.5 * (self.beta_max - self.beta_min) * (\n 1.0 - np.cos(np.pi * t))\n weights = torch.exp(beta * adv_safe).clamp(max=self.exp_adv_max)\n actor_loss = -(self.actor.log_prob(states, actions) * weights).mean()\n log_dict[\"actor_loss\"] = actor_loss.item()\n log_dict[\"beta\"] = beta\n self.actor_optimizer.zero_grad(set_to_none=True)\n actor_loss.backward()\n torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1.0)\n self.actor_optimizer.step()\n self.actor_lr_schedule.step()\n return log_dict\n"
7}OK: Replaced lines 214..405 with 200 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–413.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–413 | total: 522 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {}
215
216
217 class DeterministicActor(nn.Module):
218 """Deterministic policy pi(s) = tanh(net(s)) * max_action."""
219 def __init__(self, state_dim: int, action_dim: int, max_action: float):
220 super().__init__()
221 self.max_action = max_action
222 self.net = nn.Sequential(
223 nn.Linear(state_dim, 256), nn.ReLU(),
224 nn.Linear(256, 256), nn.ReLU(),
225 nn.Linear(256, action_dim), nn.Tanh(),
226 )
227 def forward(self,
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### pen-human-v1 (scripts/pen.sh) Training metrics (last steps): TRAIN_METRICS step=996000 value_loss=8.6624 critic_loss=154.3203 actor_loss=-375.4684 beta=3.0000 TRAIN_METRICS step=997000 value_loss=37.6828 critic_loss=98.3834 actor_loss=-525.7670 beta=3.0000 TRAIN_METRICS step=998000 value_loss=13.1060 critic_loss=122.6612 actor_loss=-201.5057 beta=3.0000 TRAIN_METRICS step=999000 value_loss=16.6531 critic_loss=50.0746 actor_loss=-383.7168 beta=3.0000 TRAIN_METRICS step=1000000 value_loss=27.3323 critic_loss=122.6432 actor_loss=-330.8565 beta=3.0000 D4RL evaluation (pen-human-v1): Evaluation over 10 episodes: 2938.018 , D4RL score: 95.343 Evaluation over 10 episodes: 2932.963 , D4RL score: 95.173 Evaluation over 10 episodes: 2927.962 , D4RL score: 95.005 Final D4RL score: 95.005 ### hammer-human-v1 (scripts/hammer.sh) Training metrics (last steps): TRAIN_METRICS step=996000 value_loss=75.6034 critic_loss=590.2727
Tool Input
1{
2 "error": "Failed to parse tool arguments: Expecting ',' delimiter: line 1 column 73 (char 72)"
3}Tool Input
1{
2 "n": 2
3}[submit] Submitting result from test #2 as final. [Test #2] (1 test remaining; call submit(n=2) to submit this result) ### pen-human-v1 (scripts/pen.sh) Training metrics (last steps): TRAIN_METRICS step=996000 value_loss=8.6624 critic_loss=154.3203 actor_loss=-375.4684 beta=3.0000 TRAIN_METRICS step=997000 value_loss=37.6828 critic_loss=98.3834 actor_loss=-525.7670 beta=3.0000 TRAIN_METRICS step=998000 value_loss=13.1060 critic_loss=122.6612 actor_loss=-201.5057 beta=3.0000 TRAIN_METRICS step=999000 value_loss=16.6531 critic_loss=50.0746 actor_loss=-383.7168 beta=3.0000 TRAIN_METRICS step=1000000 value_loss=27.3323 critic_loss=122.6432 actor_loss=-330.8565 beta=3.0000 D4RL evaluation (pen-human-v1): Evaluation over 10 episodes: 2938.018 , D4RL score: 95.343 Evaluation over 10 episodes: 2932.963 , D4RL score: 95.173 Evaluation over 10 episodes: 2927.962 , D4RL score: 95.005 Final D4RL score: 95.005 ### hammer-human-v1 (scripts/hammer.sh) Training metrics (last steps): TRAIN_METRICS s