Agent Conversation: gpt-5.4-pro

Model: gpt-5.4-pro
Total steps: 5
Messages: 12
UserStep 0

Task: rl-offline-adroit

Offline RL: Dexterous Manipulation with Narrow Expert Data (Adroit)

Objective

Design and implement an offline RL algorithm for high-dimensional dexterous manipulation from narrow human demonstration data (~25 demos). Your code goes in custom_adroit.py. Three reference implementations (BC-10%, AWAC, ReBRAC) are provided as read-only.

Background

Adroit tasks involve a 24-DoF robotic hand with high-dimensional action spaces (24-30 dims) and narrow human-v1 datasets containing only ~25 human teleoperation demonstrations, creating severe distribution shift compared to locomotion tasks.

Constraints

  • Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A _mlp() factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256.
  • Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on Pen (rotation), Door (opening), Hammer (nailing) using Adroit human-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).

CORL/algorithms/offline/custom_adroit.py [EDITABLE — lines 214–416 only]

     1: # Custom offline RL algorithm for MLS-Bench — Adroit dexterous manipulation
     2: #
     3: # EDITABLE section: network definitions + OfflineAlgorithm class.
     4: # FIXED sections: everything else (config, utilities, data, eval, training loop).
     5: import os
     6: import random
     7: import uuid
     8: from copy import deepcopy
     9: from dataclasses import dataclass
    10: from typing import Any, Dict, List, Optional, Tuple, Union
    11: 
    12: import d4rl
    13: import gym
    14: import numpy as np
    15: import pyrallis
    16: import torch
    17: import torch.nn as nn
    18: import torch.nn.functional as F
    19: from torch.distributions import Normal, TanhTransform, TransformedDistribution
    20: 
    21: TensorBatch = List[torch.Tensor]
    22: 
    23: 
    24: # =====================================================================
    25: # FIXED: Configuration
    26: # batch_size, eval_freq, n_episodes, max_timesteps are enforced here.
    27: # =====================================================================
    28: @dataclass
    29: class TrainConfig:
    30:     device: str = "cuda"
    31:     env: str = "pen-human-v1"
    32:     seed: int = 0
    33:     eval_freq: int = int(5e3)
    34:     n_episodes: int = 10
    35:     max_timesteps: int = int(1e6)
    36:     checkpoints_path: Optional[str] = None
    37:     buffer_size: int 
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "CORL/algorithms/offline/custom_adroit.py",
4  "content": "CONFIG_OVERRIDES: Dict[str, Any] = {\"normalize\": True}\n\n\nclass DeterministicActor(nn.Module):\n    \"\"\"Deterministic policy pi(s) = tanh(net(s)) * max_action.\n    Suitable for BC, TD3+BC style algorithms. Default: 2 x 256 MLP.\"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, max_action: float):\n        super().__init__()\n        self.max_action = max_action\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, action_dim), nn.Tanh(),\n        )\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.max_action * self.net(state)\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        return self(state).cpu().data.numpy().flatten()\n\n\nclass Actor(nn.Module):\n    \"\"\"Tanh-Gaussian stochastic policy.\n\n    We evaluate the mean action, but keep a stochastic parameterization so the\n    actor can optimize a weighted log-likelihood term like IQL/AWAC while also\n    supporting additional deterministic regularizers.\n    \"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, max_action: float,\n                 orthogonal_init: bool = False):\n        super().__init__()\n        self.max_action = max_action\n        self.action_dim = action_dim\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 2 * action_dim),\n        )\n        init_module_weights(self.net, orthogonal_init)\n        self.log_std_min = -20.0\n        self.log_std_max = 2.0\n\n    def _get_dist(self, state: torch.Tensor):\n        out = self.net(state)\n        mean, log_std = torch.split(out, self.action_dim, dim=-1)\n        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)\n        return TransformedDistribution(\n            Normal(mean, torch.exp(log_std)), TanhTransform(cache_size=1)\n        ), mean\n\n    def forward(self, state: torch.Tensor, deterministic: bool = False):\n        dist, mean = self._get_dist(state)\n        action = torch.tanh(mean) if deterministic else dist.rsample()\n        log_prob = dist.log_prob(action).sum(-1)\n        return self.max_action * action, log_prob\n\n    def log_prob(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n        dist, _ = self._get_dist(state)\n        action = torch.clamp(action / self.max_action, -1.0 + 1e-6, 1.0 - 1e-6)\n        return dist.log_prob(action).sum(-1)\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        actions, _ = self(state, not self.training)\n        return actions.cpu().data.numpy().flatten()\n\n\nclass Critic(nn.Module):\n    \"\"\"Q-function Q(s, a). Default: 3 x 256 MLP.\"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = False):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n        return self.net(torch.cat([state, action], dim=-1)).squeeze(-1)\n\n\nclass ValueFunction(nn.Module):\n    \"\"\"State value function V(s). Default: 3 x 256 MLP.\"\"\"\n\n    def __init__(self, state_dim: int, orthogonal_init: bool = False):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n    \"\"\"A-STAR IQL: anisotropic support-gated temporal advantage regularization.\n\n    Core idea for narrow Adroit demonstrations:\n    1) estimate support in *whitened* action space using dataset action std,\n       so deviations along low-variance joints are penalized more strongly;\n    2) use a support-gated Bellman backup that interpolates between the usual\n       optimistic value backup and a conservative dataset-next-action SARSA\n       backup when the policy drifts off the demo manifold;\n    3) add temporal action-delta consistency to the actor, leveraging the\n       dataset's sequential next-action field to preserve dexterous smoothness.\n    \"\"\"\n\n    def __init__(\n        self,\n        state_dim: int,\n        action_dim: int,\n        max_action: float,\n        replay_buffer: \"ReplayBuffer\" = None,\n        discount: float = 0.99,\n        tau: float = 5e-3,\n        actor_lr: float = 3e-4,\n        critic_lr: float = 3e-4,\n        alpha_lr: float = 3e-4,\n        orthogonal_init: bool = True,\n        device: str = \"cuda\",\n    ):\n        self.device = device\n        self.discount = discount\n        self.tau = tau\n        self.max_action = max_action\n        self.total_it = 0\n        self.replay_buffer = replay_buffer\n\n        # IQL-style backbone hyperparameters.\n        self.iql_tau = 0.8\n        self.beta = 3.0\n        self.exp_adv_max = 30.0\n\n        # A-STAR additions.\n        self.backup_gate_coef = 0.7\n        self.backup_gap_penalty = 0.05\n        self.conservative_coef = 0.10\n        self.delta_coef = 0.20\n        self.actor_q_coef = 0.05\n        self.min_scale = 0.10 * max_action\n\n        self.actor = Actor(state_dim, action_dim, max_action, orthogonal_init).to(device)\n        self.actor_target = deepcopy(self.actor).to(device)\n        self.critic_1 = Critic(state_dim, action_dim, orthogonal_init).to(device)\n        self.critic_2 = Critic(state_dim, action_dim, orthogonal_init).to(device)\n        self.critic_1_target = deepcopy(self.critic_1).to(device)\n        self.critic_2_target = deepcopy(self.critic_2).to(device)\n        self.value = ValueFunction(state_dim, orthogonal_init).to(device)\n\n        self.actor_target.requires_grad_(False)\n        self.critic_1_target.requires_grad_(False)\n        self.critic_2_target.requires_grad_(False)\n\n        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)\n        self.critic_1_optimizer = torch.optim.Adam(self.critic_1.parameters(), lr=critic_lr)\n        self.critic_2_optimizer = torch.optim.Adam(self.critic_2.parameters(), lr=critic_lr)\n        self.value_optimizer = torch.optim.Adam(self.value.parameters(), lr=critic_lr)\n\n        if replay_buffer is not None and replay_buffer._size > 0:\n            with torch.no_grad():\n                all_actions = replay_buffer._actions[: replay_buffer._size]\n                all_deltas = replay_buffer._next_actions[: replay_buffer._size] - all_actions\n                self.action_scale = all_actions.std(0).clamp_min(self.min_scale).to(device)\n                self.delta_scale = all_deltas.std(0).clamp_min(self.min_scale).to(device)\n        else:\n            self.action_scale = torch.full((action_dim,), self.min_scale, device=device)\n            self.delta_scale = torch.full((action_dim,), self.min_scale, device=device)\n\n    def _scaled_action_gap(self, pred: torch.Tensor, ref: torch.Tensor) -> torch.Tensor:\n        return (((pred - ref) / (self.action_scale + 1e-6)) ** 2).mean(dim=-1)\n\n    def _scaled_delta_gap(self, pred_delta: torch.Tensor, ref_delta: torch.Tensor) -> torch.Tensor:\n        return (((pred_delta - ref_delta) / (self.delta_scale + 1e-6)) ** 2).mean(dim=-1)\n\n    def _target_q(self, states: torch.Tensor, actions: torch.Tensor) -> torch.Tensor:\n        return torch.min(\n            self.critic_1_target(states, actions),\n            self.critic_2_target(states, actions),\n        )\n\n    def _current_q(self, states: torch.Tensor, actions: torch.Tensor) -> torch.Tensor:\n        return torch.min(self.critic_1(states, actions), self.critic_2(states, actions))\n\n    def _asymmetric_l2_loss(self, u: torch.Tensor, tau: float) -> torch.Tensor:\n        return torch.mean(torch.abs(tau - (u < 0).float()) * u.pow(2))\n\n    def train(self, batch: TensorBatch) -> Dict[str, float]:\n        self.total_it += 1\n        states, actions, rewards, next_states, dones, next_actions = batch\n        rewards = rewards.squeeze(-1)\n        dones = dones.squeeze(-1)\n        not_done = 1.0 - dones\n        log_dict: Dict[str, float] = {}\n\n        # ── Value update: standard IQL expectile, but using twin target critics. ──\n        with torch.no_grad():\n            q_data_target = self._target_q(states, actions)\n        v = self.value(states)\n        value_loss = self._asymmetric_l2_loss(q_data_target - v, self.iql_tau)\n        self.value_optimizer.zero_grad()\n        value_loss.backward()\n        self.value_optimizer.step()\n\n        # ── Critic update: support-gated backup + gap-weighted conservative term. ──\n        with torch.no_grad():\n            next_pi_target, _ = self.actor_target(next_states, deterministic=True)\n            next_gap = self._scaled_action_gap(next_pi_target, next_actions)\n            support_gate = torch.exp(-self.backup_gate_coef * next_gap).clamp(min=0.05, max=1.0)\n\n            next_v = self.value(next_states)\n            next_q_data = self._target_q(next_states, next_actions)\n            mixed_next = support_gate * next_v + (1.0 - support_gate) * next_q_data\n            mixed_next = mixed_next - self.backup_gap_penalty * next_gap\n            q_target = rewards + not_done * self.discount * mixed_next\n\n        q1 = self.critic_1(states, actions)\n        q2 = self.critic_2(states, actions)\n        critic_td_loss = F.mse_loss(q1, q_target) + F.mse_loss(q2, q_target)\n\n        with torch.no_grad():\n            pi_for_critic, _\n... (truncated)",
5  "after_line": 214,
6  "start_line": 214,
7  "end_line": 416
8}
Tool ResultStep 1
OK: Replaced lines 214..416 with 292 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–505.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–505 | total: 614 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {"normalize": True}
   215  
   216  
   217  class DeterministicActor(nn.Module):
   218      """Deterministic policy pi(s) = tanh(net(s)) * max_action.
   219      Suitable for BC, TD3+BC style algorithms. Default: 2 x 256 MLP."""
   220  
   221      def __init__(self, state_dim: int, action_dim: int, max_action: float):
   222          super().__init__()
   223          self.max_action = max_action
   224          self.net = nn.Sequential(
   225              nn.Linear(state_dim, 256), nn.ReLU(),
   226              nn.Linear(256, 256), nn.ReLU(),
   22
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)


## Seed 42

### pen-human-v1 (scripts/pen.sh)
Training metrics (last steps):
TRAIN_METRICS step=661000 value_loss=23.9836 critic_loss=3286.7000 critic_td_loss=3286.3081 conservative_loss=3.9174 actor_loss=-232.2850 bc_loss=-232.4141 delta_loss=0.8862 actor_q_loss=-0.9635 support_gate=0.9345 policy_gap=0.1494
TRAIN_METRICS step=662000 value_loss=18.6233 critic_loss=87082.5312 critic_td_loss=87081.2891 conservative_loss=12.4311 actor_loss=-227.5035 bc_loss=-227.5879 delta_loss=0.6654 actor_q_loss=-0.9725 support_gate=0.9473 policy_gap=0.1539
TRAIN_METRICS step=663000 value_loss=84.3826 critic_loss=2852.2166 critic_td_loss=2851.9712 conservative_loss=2.4534 actor_loss=-227.5135 bc_loss=-227.6848 delta_loss=1.0972 actor_q_loss=-0.9630 support_gate=0.9368 policy_gap=0.1916
TRAIN_METRICS step=664000 value_loss=109.5653 critic_loss=2816.9890 critic_td_loss=2816.1770 conservative_loss=8.1190 actor_loss=-232.2305 bc_loss=-2
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "CORL/algorithms/offline/custom_adroit.py",
4  "content": "CONFIG_OVERRIDES: Dict[str, Any] = {}\n\n\nclass DeterministicActor(nn.Module):\n    \"\"\"Deterministic policy pi(s) = tanh(net(s)) * max_action.\"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, max_action: float):\n        super().__init__()\n        self.max_action = max_action\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, action_dim), nn.Tanh(),\n        )\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.max_action * self.net(state)\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        return self(state).cpu().data.numpy().flatten()\n\n\nclass Actor(nn.Module):\n    \"\"\"Gaussian policy with bounded mean and state-independent log_std.\n\n    This matches the robust IQL-style parameterization for Adroit, while still\n    letting us add extra actor-side regularizers beyond pure AWR.\n    \"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, max_action: float,\n                 orthogonal_init: bool = False):\n        super().__init__()\n        self.max_action = max_action\n        self.action_dim = action_dim\n        self._mlp = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(), nn.Dropout(0.1),\n            nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.1),\n            nn.Linear(256, action_dim), nn.Tanh(),\n        )\n        self._log_std = nn.Parameter(torch.zeros(action_dim, dtype=torch.float32))\n        self._min_log_std = -20.0\n        self._max_log_std = 2.0\n\n    def _get_policy(self, state: torch.Tensor):\n        mean = self._mlp(state)\n        log_std = self._log_std.clamp(self._min_log_std, self._max_log_std)\n        return Normal(mean, log_std.exp())\n\n    def mean_action(self, state: torch.Tensor) -> torch.Tensor:\n        return self.max_action * self._mlp(state)\n\n    def log_prob(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n        action = torch.clamp(action / self.max_action, -1.0 + 1e-6, 1.0 - 1e-6)\n        policy = self._get_policy(state)\n        return policy.log_prob(action).sum(-1)\n\n    def forward(self, state: torch.Tensor, deterministic: bool = False):\n        policy = self._get_policy(state)\n        action = policy.mean if deterministic else policy.rsample()\n        action = torch.clamp(action, -1.0, 1.0)\n        log_prob = policy.log_prob(action).sum(-1)\n        return self.max_action * action, log_prob\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        policy = self._get_policy(state)\n        if self._mlp.training:\n            action = policy.sample()\n        else:\n            action = policy.mean\n        action = torch.clamp(self.max_action * action, -self.max_action, self.max_action)\n        return action[0].cpu().numpy()\n\n\nclass Critic(nn.Module):\n    \"\"\"Twin Q-function for IQL-style training.\"\"\"\n\n    def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = False):\n        super().__init__()\n        self.q1 = nn.Sequential(\n            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        self.q2 = nn.Sequential(\n            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n\n    def both(self, state: torch.Tensor, action: torch.Tensor):\n        sa = torch.cat([state, action], dim=-1)\n        return self.q1(sa).squeeze(-1), self.q2(sa).squeeze(-1)\n\n    def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:\n        q1, q2 = self.both(state, action)\n        return torch.min(q1, q2)\n\n\nclass ValueFunction(nn.Module):\n    \"\"\"State value function V(s).\"\"\"\n\n    def __init__(self, state_dim: int, orthogonal_init: bool = False):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n    \"\"\"TC-IQL: Trajectory-Consistent Implicit Q-Learning.\n\n    Novel contribution:\n    - keep IQL's strong expectile value learning backbone for narrow expert data;\n    - add a trajectory-consistency actor regularizer using dataset next_actions,\n      encouraging the policy's action delta mu(s')-mu(s) to match the demo\n      delta a'-a in a per-dimension whitened action space;\n    - add a support-gated Q-improvement term, so actor improvement via Q is only\n      trusted when the deterministic policy stays near the demonstration action.\n\n    This specifically targets Adroit's contact-rich dexterous control, where\n    temporal smoothness and staying on the demo manifold matter more than in\n    locomotion.\n    \"\"\"\n\n    def __init__(\n        self,\n        state_dim: int,\n        action_dim: int,\n        max_action: float,\n        replay_buffer: \"ReplayBuffer\" = None,\n        discount: float = 0.99,\n        tau: float = 5e-3,\n        actor_lr: float = 3e-4,\n        critic_lr: float = 3e-4,\n        alpha_lr: float = 3e-4,\n        orthogonal_init: bool = True,\n        device: str = \"cuda\",\n    ):\n        self.device = device\n        self.discount = discount\n        self.tau = tau\n        self.max_action = max_action\n        self.total_it = 0\n        self.replay_buffer = replay_buffer\n\n        # IQL backbone.\n        self.iql_tau = 0.8\n        self.beta = 3.0\n        self.exp_adv_max = 100.0\n\n        # TC-IQL additions.\n        self.delta_coef = 0.05\n        self.actor_q_coef = 0.01\n        self.support_temperature = 0.5\n        self.min_action_scale = 0.10 * max_action\n        self.min_delta_scale = 0.05 * max_action\n\n        self.actor = Actor(state_dim, action_dim, max_action, orthogonal_init).to(device)\n        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)\n        self.actor_lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(\n            self.actor_optimizer, T_max=int(1e6)\n        )\n\n        self.qf = Critic(state_dim, action_dim, orthogonal_init).to(device)\n        self.qf_target = deepcopy(self.qf)\n        self.qf_target.requires_grad_(False)\n        self.q_optimizer = torch.optim.Adam(self.qf.parameters(), lr=critic_lr)\n\n        self.vf = ValueFunction(state_dim, orthogonal_init).to(device)\n        self.v_optimizer = torch.optim.Adam(self.vf.parameters(), lr=critic_lr)\n\n        if replay_buffer is not None and replay_buffer._size > 0:\n            with torch.no_grad():\n                all_actions = replay_buffer._actions[: replay_buffer._size].to(device)\n                all_deltas = (\n                    replay_buffer._next_actions[: replay_buffer._size]\n                    - replay_buffer._actions[: replay_buffer._size]\n                ).to(device)\n                self.action_scale = all_actions.std(0).clamp_min(self.min_action_scale)\n                self.delta_scale = all_deltas.std(0).clamp_min(self.min_delta_scale)\n        else:\n            self.action_scale = torch.full((action_dim,), self.min_action_scale, device=device)\n            self.delta_scale = torch.full((action_dim,), self.min_delta_scale, device=device)\n\n    def _asymmetric_l2_loss(self, u: torch.Tensor, tau: float) -> torch.Tensor:\n        return torch.mean(torch.abs(tau - (u < 0).float()) * u ** 2)\n\n    def _scaled_sq_error(\n        self, pred: torch.Tensor, target: torch.Tensor, scale: torch.Tensor\n    ) -> torch.Tensor:\n        return (((pred - target) / (scale + 1e-6)) ** 2).mean(dim=-1)\n\n    def train(self, batch: TensorBatch) -> Dict[str, float]:\n        self.total_it += 1\n        states, actions, rewards, next_states, dones, next_actions = batch\n        rewards = rewards.squeeze(-1)\n        dones = dones.squeeze(-1)\n        not_done = 1.0 - dones\n        log_dict: Dict[str, float] = {}\n\n        # ── Value update: standard IQL expectile regression. ──────────────────\n        with torch.no_grad():\n            target_q = self.qf_target(states, actions)\n        v = self.vf(states)\n        adv = target_q - v\n        v_loss = self._asymmetric_l2_loss(adv, self.iql_tau)\n\n        self.v_optimizer.zero_grad()\n        v_loss.backward()\n        self.v_optimizer.step()\n\n        # ── Critic update: standard IQL Bellman backup through V(s'). ────────\n        with torch.no_grad():\n            next_v = self.vf(next_states)\n            q_target = rewards + (1.0 - dones) * self.discount * next_v\n\n        q1, q2 = self.qf.both(states, actions)\n        q_loss = 0.5 * (F.mse_loss(q1, q_target) + F.mse_loss(q2, q_target))\n\n        self.q_optimizer.zero_grad()\n        q_loss.backward()\n        self.q_optimizer.step()\n        soft_update(self.qf_target, self.qf, self.tau)\n\n        # ── Actor update: AWR + trajectory consistency + support-gated Q. ────\n        with torch.no_grad():\n            q_data = self.qf_target(states, actions)\n            v_data = self.vf(states)\n            adv = q_data - v_data\n            exp_adv = torch.exp(self.beta * adv).clamp(max=self.exp_adv_max)\n\n            next_q_data = self.qf_target(next_states, next_actions)\n            next_v_data = self.vf(next_states)\n            next_adv = next_q_data - next_v_data\n            traj_w = torch.sqrt(\n                exp_adv * torch.exp(self.beta * next_adv).clamp(max=self.exp_adv_max)\n            )\n            traj_w = (traj_w / (traj_w.mean() + 1e-6)) * not_done\n            q_scale = q_data.abs().mean().clamp_min(1.0)\n\n        action_log_prob = self.actor.l\n... (truncated)",
5  "after_line": 214,
6  "start_line": 214,
7  "end_line": 505
8}
Tool ResultStep 3
OK: Replaced lines 214..505 with 282 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–495.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–495 | total: 604 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {}
   215  
   216  
   217  class DeterministicActor(nn.Module):
   218      """Deterministic policy pi(s) = tanh(net(s)) * max_action."""
   219  
   220      def __init__(self, state_dim: int, action_dim: int, max_action: float):
   221          super().__init__()
   222          self.max_action = max_action
   223          self.net = nn.Sequential(
   224              nn.Linear(state_dim, 256), nn.ReLU(),
   225              nn.Linear(256, 256), nn.ReLU(),
   226              nn.Linear(256, action_dim), nn.Tanh(),
   227          )
   228  
   229      
AssistantStep 4test
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### pen-human-v1 (scripts/pen.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 value_loss=5.8980 critic_loss=59.1726 actor_loss=-320.6846 bc_loss=-320.6855 delta_loss=0.2159 actor_q_loss=-0.9942 adv_mean=-3.3603 support=0.9938
TRAIN_METRICS step=997000 value_loss=21.3534 critic_loss=147.4313 actor_loss=-580.1871 bc_loss=-580.1862 delta_loss=0.1813 actor_q_loss=-0.9974 adv_mean=-4.7429 support=0.9943
TRAIN_METRICS step=998000 value_loss=13.5331 critic_loss=30.7913 actor_loss=-209.2276 bc_loss=-209.2454 delta_loss=0.5560 actor_q_loss=-0.9949 adv_mean=-2.4684 support=0.9934
TRAIN_METRICS step=999000 value_loss=11.0342 critic_loss=50.9193 actor_loss=-857.1537 bc_loss=-857.1578 delta_loss=0.2792 actor_q_loss=-0.9930 adv_mean=-1.7107 support=0.9931
TRAIN_METRICS step=1000000 value_loss=4.7225 critic_loss=37.5525 actor_loss=-1007.7856 bc_loss=-1007.7932 delta_loss=0.3495 actor_q_loss=-0.9950 adv_mean=-0.4671 sup
AssistantStep 5submit
Tool Input
1{
2  "n": 2
3}
Tool ResultStep 5
[submit] Submitting result from test #2 as final.

### pen-human-v1 (scripts/pen.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 value_loss=5.8980 critic_loss=59.1726 actor_loss=-320.6846 bc_loss=-320.6855 delta_loss=0.2159 actor_q_loss=-0.9942 adv_mean=-3.3603 support=0.9938
TRAIN_METRICS step=997000 value_loss=21.3534 critic_loss=147.4313 actor_loss=-580.1871 bc_loss=-580.1862 delta_loss=0.1813 actor_q_loss=-0.9974 adv_mean=-4.7429 support=0.9943
TRAIN_METRICS step=998000 value_loss=13.5331 critic_loss=30.7913 actor_loss=-209.2276 bc_loss=-209.2454 delta_loss=0.5560 actor_q_loss=-0.9949 adv_mean=-2.4684 support=0.9934
TRAIN_METRICS step=999000 value_loss=11.0342 critic_loss=50.9193 actor_loss=-857.1537 bc_loss=-857.1578 delta_loss=0.2792 actor_q_loss=-0.9930 adv_mean=-1.7107 support=0.9931
TRAIN_METRICS step=1000000 value_loss=4.7225 critic_loss=37.5525 actor_loss=-1007.7856 bc_loss=-1007.7932 delta_loss=0.3495 actor_q_loss=-0.9950 adv_mean=-0.4671 support=0.9944
D4RL ev