Agent Conversation: google/gemini-3.1-pro-preview
Task: rl-offline-adroit
Offline RL: Dexterous Manipulation with Narrow Expert Data (Adroit)
Objective
Design and implement an offline RL algorithm for high-dimensional dexterous manipulation from narrow human demonstration data (~25 demos). Your code goes in custom_adroit.py. Three reference implementations (BC-10%, AWAC, ReBRAC) are provided as read-only.
Background
Adroit tasks involve a 24-DoF robotic hand with high-dimensional action spaces (24-30 dims) and narrow human-v1 datasets containing only ~25 human teleoperation demonstrations, creating severe distribution shift compared to locomotion tasks.
Constraints
- Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A
_mlp()factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256. - Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on Pen (rotation), Door (opening), Hammer (nailing) using Adroit human-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).
CORL/algorithms/offline/custom_adroit.py [EDITABLE — lines 214–416 only]
1: # Custom offline RL algorithm for MLS-Bench — Adroit dexterous manipulation
2: #
3: # EDITABLE section: network definitions + OfflineAlgorithm class.
4: # FIXED sections: everything else (config, utilities, data, eval, training loop).
5: import os
6: import random
7: import uuid
8: from copy import deepcopy
9: from dataclasses import dataclass
10: from typing import Any, Dict, List, Optional, Tuple, Union
11:
12: import d4rl
13: import gym
14: import numpy as np
15: import pyrallis
16: import torch
17: import torch.nn as nn
18: import torch.nn.functional as F
19: from torch.distributions import Normal, TanhTransform, TransformedDistribution
20:
21: TensorBatch = List[torch.Tensor]
22:
23:
24: # =====================================================================
25: # FIXED: Configuration
26: # batch_size, eval_freq, n_episodes, max_timesteps are enforced here.
27: # =====================================================================
28: @dataclass
29: class TrainConfig:
30: device: str = "cuda"
31: env: str = "pen-human-v1"
32: seed: int = 0
33: eval_freq: int = int(5e3)
34: n_episodes: int = 10
35: max_timesteps: int = int(1e6)
36: checkpoints_path: Optional[str] = None
37: buffer_size: int Tool Input
1{
2 "op": "replace",
3 "start_line": 214,
4 "end_line": 416,
5 "filename": "CORL/algorithms/offline/custom_adroit.py",
6 "content": "CONFIG_OVERRIDES: Dict[str, Any] = {\n \"normalize\": True,\n \"actor_lr\": 3e-4,\n \"critic_lr\": 3e-4,\n}\n\n\nclass DeterministicActor(nn.Module):\n def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):\n super().__init__()\n self.max_action = max_action\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, action_dim), nn.Tanh(),\n )\n init_module_weights(self.net, orthogonal_init)\n\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.max_action * self.net(state)\n\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n return self(state).cpu().data.numpy().flatten()\n\n\nclass CriticTwin(nn.Module):\n def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = True):\n super().__init__()\n self.q1 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n self.q2 = nn.Sequential(\n nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n init_module_weights(self.q1, orthogonal_init)\n init_module_weights(self.q2, orthogonal_init)\n\n def both(self, state: torch.Tensor, action: torch.Tensor):\n sa = torch.cat([state, action], dim=-1)\n return self.q1(sa).squeeze(-1), self.q2(sa).squeeze(-1)\n\n\nclass ValueFunction(nn.Module):\n def __init__(self, state_dim: int, orthogonal_init: bool = True):\n super().__init__()\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 1),\n )\n init_module_weights(self.net, orthogonal_init)\n\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n \"\"\"Manifold-Augmented Deterministic Behavior Cloning (MA-DBC).\n \n Novelties:\n 1. Advantage-weighted Deterministic policy extraction (avoids variance collapse).\n 2. Local State-Manifold Augmentation (improves OOD robustness for narrow data).\n 3. Batch-normalized advantage scaling and pessimistic Expectile Q-learning natively smoothly combined.\n \"\"\"\n def __init__(\n self, state_dim: int, action_dim: int, max_action: float, replay_buffer=None,\n discount: float = 0.99, tau: float = 5e-3, actor_lr: float = 3e-4, \n critic_lr: float = 3e-4, alpha_lr: float = 3e-4, orthogonal_init: bool = True, device: str = \"cuda\"\n ):\n self.device = device\n self.discount = discount\n self.tau = tau\n self.max_action = max_action\n self.total_it = 0\n \n # Algorithmic hyperparameters\n self.iql_tau = 0.8\n self.beta = 5.0\n self.state_noise = 0.05\n self.action_noise = 0.01\n\n self.actor = DeterministicActor(state_dim, action_dim, max_action, orthogonal_init).to(device)\n self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr, weight_decay=1e-4)\n\n self.critic = CriticTwin(state_dim, action_dim, orthogonal_init).to(device)\n self.critic_target = deepcopy(self.critic)\n self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)\n\n self.vf = ValueFunction(state_dim, orthogonal_init).to(device)\n self.v_optimizer = torch.optim.Adam(self.vf.parameters(), lr=critic_lr)\n\n def _expectile_loss(self, diff, tau):\n weight = torch.where(diff > 0, tau, (1 - tau))\n return torch.mean(weight * (diff**2))\n\n def train(self, batch) -> Dict[str, float]:\n self.total_it += 1\n states, actions, rewards, next_states, dones, _ = batch\n rewards = rewards.squeeze(-1)\n dones = dones.squeeze(-1)\n\n # 1. State-Manifold Augmentation\n s_noise = torch.randn_like(states) * self.state_noise\n aug_states = states + s_noise\n a_noise = (torch.randn_like(actions) * self.action_noise).clamp(-self.max_action*0.1, self.max_action*0.1)\n aug_actions = (actions + a_noise).clamp(-self.max_action, self.max_action)\n\n cat_states = torch.cat([states, aug_states], dim=0)\n cat_actions = torch.cat([actions, aug_actions], dim=0)\n cat_next_states = torch.cat([next_states, next_states], dim=0)\n cat_rewards = torch.cat([rewards, rewards], dim=0)\n cat_dones = torch.cat([dones, dones], dim=0)\n\n # 2. Implicit Value Function Update\n with torch.no_grad():\n q1, q2 = self.critic_target.both(cat_states, cat_actions)\n q_tgt = torch.min(q1, q2)\n\n v = self.vf(cat_states)\n v_loss = self._expectile_loss(q_tgt - v, self.iql_tau)\n \n self.v_optimizer.zero_grad()\n v_loss.backward()\n self.v_optimizer.step()\n\n # 3. Critic Twin Update\n with torch.no_grad():\n next_v = self.vf(cat_next_states)\n q_target = cat_rewards + self.discount * (1.0 - cat_dones) * next_v\n\n q1_curr, q2_curr = self.critic.both(cat_states, cat_actions)\n q_loss = F.mse_loss(q1_curr, q_target) + F.mse_loss(q2_curr, q_target)\n\n self.critic_optimizer.zero_grad()\n q_loss.backward()\n self.critic_optimizer.step()\n\n # 4. Advantage-Weighted Deterministic Policy Extraction\n with torch.no_grad():\n q1_clean, q2_clean = self.critic_target.both(states, actions)\n q_act = torch.min(q1_clean, q2_clean)\n v_act = self.vf(states)\n adv = q_act - v_act\n # Batch-norm advantage for robust extraction\n adv = (adv - adv.mean()) / (adv.std() + 1e-8)\n weights = torch.exp(self.beta * adv).clamp(max=100.0)\n\n act_pred = self.actor(states)\n act_pred_aug = self.actor(aug_states)\n\n mse = F.mse_loss(act_pred, actions, reduction='none').mean(dim=1)\n mse_aug = F.mse_loss(act_pred_aug, actions, reduction='none').mean(dim=1)\n\n actor_loss = (weights * (mse + mse_aug) * 0.5).mean()\n\n self.actor_optimizer.zero_grad()\n actor_loss.backward()\n self.actor_optimizer.step()\n\n soft_update(self.critic_target, self.critic, self.tau)\n\n return {\n \"actor_loss\": actor_loss.item(),\n \"critic_loss\": q_loss.item(),\n \"value_loss\": v_loss.item()\n }\n",
7 "after_line": 213
8}OK: Replaced lines 214..416 with 176 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–389.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–389 | total: 498 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU()
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ## Seed 42 ### pen-human-v1 (scripts/pen.sh) WARNING: Overriding HOME environment variable with APPTAINERENV_HOME is not permitted /usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (zh_CN.UTF-8) pybullet build time: Jan 29 2025 23:19:57 Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality. Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade. See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information. Warning: Flow failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message. No module named 'flow' Warning: CARLA failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message. No module named 'carla' /usr/local/lib/python3.8/dist-pac
Tool Input
1{
2 "end_line": 239,
3 "content": "class DeterministicActor(nn.Module):\n def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):\n super().__init__()\n self.max_action = max_action\n self.net = nn.Sequential(\n nn.Linear(state_dim, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, 256), nn.ReLU(),\n nn.Linear(256, action_dim),\n )\n init_module_weights(self.net, orthogonal_init)\n\n def forward(self, state: torch.Tensor) -> torch.Tensor:\n return self.max_action * torch.tanh(self.net(state))\n\n @torch.no_grad()\n def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n return self(state).cpu().data.numpy().flatten()",
4 "after_line": 220,
5 "start_line": 221,
6 "op": "replace",
7 "filename": "CORL/algorithms/offline/custom_adroit.py"
8}OK: Replaced lines 221..239 with 19 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–389.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–389 | total: 498 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU(),
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### pen-human-v1 (scripts/pen.sh) Training metrics (last steps): TRAIN_METRICS step=996000 actor_loss=0.0091 critic_loss=168.1975 value_loss=16.6622 TRAIN_METRICS step=997000 actor_loss=0.0065 critic_loss=178.6120 value_loss=14.3989 TRAIN_METRICS step=998000 actor_loss=0.0051 critic_loss=913.4302 value_loss=36.4418 TRAIN_METRICS step=999000 actor_loss=0.0063 critic_loss=161.8335 value_loss=21.5565 TRAIN_METRICS step=1000000 actor_loss=0.0080 critic_loss=117.1778 value_loss=11.2903 D4RL evaluation (pen-human-v1): Evaluation over 10 episodes: 3822.210 , D4RL score: 125.008 Evaluation over 10 episodes: 3903.808 , D4RL score: 127.746 Evaluation over 10 episodes: 3925.793 , D4RL score: 128.483 Final D4RL score: 128.483 ### hammer-human-v1 (scripts/hammer.sh) Training metrics (last steps): TRAIN_METRICS step=996000 actor_loss=0.0769 critic_loss=424.3253 value_loss=48.7755 TRAIN_METRICS step=997000 actor_loss=0.0356 critic
Tool Input
1{
2 "end_line": 368,
3 "filename": "CORL/algorithms/offline/custom_adroit.py",
4 "start_line": 366,
5 "op": "replace",
6 "content": " # adv = (adv - adv.mean()) / (adv.std() + 1e-8) # REMOVED: destroys global value scale on cloned datasets!",
7 "after_line": 365
8}OK: Replaced lines 366..368 with 1 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–387.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–387 | total: 496 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU(),
Tool Input
1{
2 "filename": "CORL/algorithms/offline/custom_adroit.py",
3 "after_line": 365,
4 "end_line": 365,
5 "start_line": 365,
6 "content": " adv = q_act - v_act",
7 "op": "replace"
8}OK: Replaced lines 365..365 with 1 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–387.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–387 | total: 496 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU(),
Tool Input
1{
2 "content": " v_act = self.vf(states)",
3 "filename": "CORL/algorithms/offline/custom_adroit.py",
4 "end_line": 365,
5 "start_line": 365,
6 "op": "insert",
7 "after_line": 365
8}OK: Inserted 1 line(s) after line 365 in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–388.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–388 | total: 497 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU(),
22
Tool Input
1{
2 "end_line": 366,
3 "op": "replace",
4 "filename": "CORL/algorithms/offline/custom_adroit.py",
5 "after_line": 365,
6 "content": " v_act = self.vf(states)\n adv = q_act - v_act",
7 "start_line": 365
8}OK: Replaced lines 365..366 with 2 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–388.
[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–388 | total: 497 lines]
211| # tau, discount.
212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
213| # =====================================================================
214 CONFIG_OVERRIDES: Dict[str, Any] = {
215 "normalize": True,
216 "actor_lr": 3e-4,
217 "critic_lr": 3e-4,
218 }
219
220
221 class DeterministicActor(nn.Module):
222 def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
223 super().__init__()
224 self.max_action = max_action
225 self.net = nn.Sequential(
226 nn.Linear(state_dim, 256), nn.ReLU(),
227 nn.Linear(256, 256), nn.ReLU(),
228 nn.Linear(256, 256), nn.ReLU(),
[Test #3] (0 tests remaining) ## Seed 42 ### pen-human-v1 (scripts/pen.sh) Training metrics (last steps): TRAIN_METRICS step=996000 actor_loss=0.0059 critic_loss=168.1975 value_loss=16.6622 TRAIN_METRICS step=997000 actor_loss=0.0113 critic_loss=178.6120 value_loss=14.3989 TRAIN_METRICS step=998000 actor_loss=0.0079 critic_loss=913.4302 value_loss=36.4418 TRAIN_METRICS step=999000 actor_loss=0.0018 critic_loss=161.8335 value_loss=21.5565 TRAIN_METRICS step=1000000 actor_loss=0.0097 critic_loss=117.1778 value_loss=11.2903 D4RL evaluation (pen-human-v1): Evaluation over 10 episodes: 3048.116 , D4RL score: 99.037 Evaluation over 10 episodes: 3121.335 , D4RL score: 101.493 Evaluation over 10 episodes: 3274.281 , D4RL score: 106.625 Final D4RL score: 106.625 ### hammer-human-v1 (scripts/hammer.sh) Training metrics (last steps): TRAIN_METRICS step=996000 actor_loss=0.1469 critic_loss=424.3253 value_loss=48.7755 TRAIN_METRICS step=997000 actor_loss=0.1669 critic_loss=265.1998 value_loss=4