Agent Conversation: google/gemini-3.1-pro-preview

Model: google/gemini-3.1-pro-preview

Total steps: 9

Messages: 19

UserStep 0

Task: rl-offline-adroit

Offline RL: Dexterous Manipulation with Narrow Expert Data (Adroit)

Objective

Design and implement an offline RL algorithm for high-dimensional dexterous manipulation from narrow human demonstration data (~25 demos). Your code goes in custom_adroit.py. Three reference implementations (BC-10%, AWAC, ReBRAC) are provided as read-only.

Background

Adroit tasks involve a 24-DoF robotic hand with high-dimensional action spaces (24-30 dims) and narrow human-v1 datasets containing only ~25 human teleoperation demonstrations, creating severe distribution shift compared to locomotion tasks.

Constraints

Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A _mlp() factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256.
Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on Pen (rotation), Door (opening), Hammer (nailing) using Adroit human-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).

CORL/algorithms/offline/custom_adroit.py [EDITABLE — lines 214–416 only]

     1: # Custom offline RL algorithm for MLS-Bench — Adroit dexterous manipulation
     2: #
     3: # EDITABLE section: network definitions + OfflineAlgorithm class.
     4: # FIXED sections: everything else (config, utilities, data, eval, training loop).
     5: import os
     6: import random
     7: import uuid
     8: from copy import deepcopy
     9: from dataclasses import dataclass
    10: from typing import Any, Dict, List, Optional, Tuple, Union
    11: 
    12: import d4rl
    13: import gym
    14: import numpy as np
    15: import pyrallis
    16: import torch
    17: import torch.nn as nn
    18: import torch.nn.functional as F
    19: from torch.distributions import Normal, TanhTransform, TransformedDistribution
    20: 
    21: TensorBatch = List[torch.Tensor]
    22: 
    23: 
    24: # =====================================================================
    25: # FIXED: Configuration
    26: # batch_size, eval_freq, n_episodes, max_timesteps are enforced here.
    27: # =====================================================================
    28: @dataclass
    29: class TrainConfig:
    30:     device: str = "cuda"
    31:     env: str = "pen-human-v1"
    32:     seed: int = 0
    33:     eval_freq: int = int(5e3)
    34:     n_episodes: int = 10
    35:     max_timesteps: int = int(1e6)
    36:     checkpoints_path: Optional[str] = None
    37:     buffer_size: int

AssistantStep 1edit

Tool Input

1{
2  "op": "replace",
3  "start_line": 214,
4  "end_line": 416,
5  "filename": "CORL/algorithms/offline/custom_adroit.py",
6  "content": "CONFIG_OVERRIDES: Dict[str, Any] = {\n    \"normalize\": True,\n    \"actor_lr\": 3e-4,\n    \"critic_lr\": 3e-4,\n}\n\n\nclass DeterministicActor(nn.Module):\n    def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):\n        super().__init__()\n        self.max_action = max_action\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, action_dim), nn.Tanh(),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.max_action * self.net(state)\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        return self(state).cpu().data.numpy().flatten()\n\n\nclass CriticTwin(nn.Module):\n    def __init__(self, state_dim: int, action_dim: int, orthogonal_init: bool = True):\n        super().__init__()\n        self.q1 = nn.Sequential(\n            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        self.q2 = nn.Sequential(\n            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        init_module_weights(self.q1, orthogonal_init)\n        init_module_weights(self.q2, orthogonal_init)\n\n    def both(self, state: torch.Tensor, action: torch.Tensor):\n        sa = torch.cat([state, action], dim=-1)\n        return self.q1(sa).squeeze(-1), self.q2(sa).squeeze(-1)\n\n\nclass ValueFunction(nn.Module):\n    def __init__(self, state_dim: int, orthogonal_init: bool = True):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 1),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.net(state).squeeze(-1)\n\n\nclass OfflineAlgorithm:\n    \"\"\"Manifold-Augmented Deterministic Behavior Cloning (MA-DBC).\n    \n    Novelties:\n    1. Advantage-weighted Deterministic policy extraction (avoids variance collapse).\n    2. Local State-Manifold Augmentation (improves OOD robustness for narrow data).\n    3. Batch-normalized advantage scaling and pessimistic Expectile Q-learning natively smoothly combined.\n    \"\"\"\n    def __init__(\n        self, state_dim: int, action_dim: int, max_action: float, replay_buffer=None,\n        discount: float = 0.99, tau: float = 5e-3, actor_lr: float = 3e-4, \n        critic_lr: float = 3e-4, alpha_lr: float = 3e-4, orthogonal_init: bool = True, device: str = \"cuda\"\n    ):\n        self.device = device\n        self.discount = discount\n        self.tau = tau\n        self.max_action = max_action\n        self.total_it = 0\n        \n        # Algorithmic hyperparameters\n        self.iql_tau = 0.8\n        self.beta = 5.0\n        self.state_noise = 0.05\n        self.action_noise = 0.01\n\n        self.actor = DeterministicActor(state_dim, action_dim, max_action, orthogonal_init).to(device)\n        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr, weight_decay=1e-4)\n\n        self.critic = CriticTwin(state_dim, action_dim, orthogonal_init).to(device)\n        self.critic_target = deepcopy(self.critic)\n        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)\n\n        self.vf = ValueFunction(state_dim, orthogonal_init).to(device)\n        self.v_optimizer = torch.optim.Adam(self.vf.parameters(), lr=critic_lr)\n\n    def _expectile_loss(self, diff, tau):\n        weight = torch.where(diff > 0, tau, (1 - tau))\n        return torch.mean(weight * (diff**2))\n\n    def train(self, batch) -> Dict[str, float]:\n        self.total_it += 1\n        states, actions, rewards, next_states, dones, _ = batch\n        rewards = rewards.squeeze(-1)\n        dones = dones.squeeze(-1)\n\n        # 1. State-Manifold Augmentation\n        s_noise = torch.randn_like(states) * self.state_noise\n        aug_states = states + s_noise\n        a_noise = (torch.randn_like(actions) * self.action_noise).clamp(-self.max_action*0.1, self.max_action*0.1)\n        aug_actions = (actions + a_noise).clamp(-self.max_action, self.max_action)\n\n        cat_states = torch.cat([states, aug_states], dim=0)\n        cat_actions = torch.cat([actions, aug_actions], dim=0)\n        cat_next_states = torch.cat([next_states, next_states], dim=0)\n        cat_rewards = torch.cat([rewards, rewards], dim=0)\n        cat_dones = torch.cat([dones, dones], dim=0)\n\n        # 2. Implicit Value Function Update\n        with torch.no_grad():\n            q1, q2 = self.critic_target.both(cat_states, cat_actions)\n            q_tgt = torch.min(q1, q2)\n\n        v = self.vf(cat_states)\n        v_loss = self._expectile_loss(q_tgt - v, self.iql_tau)\n        \n        self.v_optimizer.zero_grad()\n        v_loss.backward()\n        self.v_optimizer.step()\n\n        # 3. Critic Twin Update\n        with torch.no_grad():\n            next_v = self.vf(cat_next_states)\n            q_target = cat_rewards + self.discount * (1.0 - cat_dones) * next_v\n\n        q1_curr, q2_curr = self.critic.both(cat_states, cat_actions)\n        q_loss = F.mse_loss(q1_curr, q_target) + F.mse_loss(q2_curr, q_target)\n\n        self.critic_optimizer.zero_grad()\n        q_loss.backward()\n        self.critic_optimizer.step()\n\n        # 4. Advantage-Weighted Deterministic Policy Extraction\n        with torch.no_grad():\n            q1_clean, q2_clean = self.critic_target.both(states, actions)\n            q_act = torch.min(q1_clean, q2_clean)\n            v_act = self.vf(states)\n            adv = q_act - v_act\n            # Batch-norm advantage for robust extraction\n            adv = (adv - adv.mean()) / (adv.std() + 1e-8)\n            weights = torch.exp(self.beta * adv).clamp(max=100.0)\n\n        act_pred = self.actor(states)\n        act_pred_aug = self.actor(aug_states)\n\n        mse = F.mse_loss(act_pred, actions, reduction='none').mean(dim=1)\n        mse_aug = F.mse_loss(act_pred_aug, actions, reduction='none').mean(dim=1)\n\n        actor_loss = (weights * (mse + mse_aug) * 0.5).mean()\n\n        self.actor_optimizer.zero_grad()\n        actor_loss.backward()\n        self.actor_optimizer.step()\n\n        soft_update(self.critic_target, self.critic, self.tau)\n\n        return {\n            \"actor_loss\": actor_loss.item(),\n            \"critic_loss\": q_loss.item(),\n            \"value_loss\": v_loss.item()\n        }\n",
7  "after_line": 213
8}

Tool ResultStep 1

OK: Replaced lines 214..416 with 176 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–389.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–389 | total: 498 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU()

AssistantStep 2test

Tool ResultStep 2

[Test #1] (2 tests remaining; call submit(n=1) to submit this result)


## Seed 42

### pen-human-v1 (scripts/pen.sh)
WARNING: Overriding HOME environment variable with APPTAINERENV_HOME is not permitted
/usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (zh_CN.UTF-8)
pybullet build time: Jan 29 2025 23:19:57
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Warning: Flow failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'flow'
Warning: CARLA failed to import. Set the environment variable D4RL_SUPPRESS_IMPORT_ERROR=1 to suppress this message.
No module named 'carla'
/usr/local/lib/python3.8/dist-pac

AssistantStep 3edit

Tool Input

1{
2  "end_line": 239,
3  "content": "class DeterministicActor(nn.Module):\n    def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):\n        super().__init__()\n        self.max_action = max_action\n        self.net = nn.Sequential(\n            nn.Linear(state_dim, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, 256), nn.ReLU(),\n            nn.Linear(256, action_dim),\n        )\n        init_module_weights(self.net, orthogonal_init)\n\n    def forward(self, state: torch.Tensor) -> torch.Tensor:\n        return self.max_action * torch.tanh(self.net(state))\n\n    @torch.no_grad()\n    def act(self, state: np.ndarray, device: str = \"cpu\") -> np.ndarray:\n        state = torch.tensor(state.reshape(1, -1), device=device, dtype=torch.float32)\n        return self(state).cpu().data.numpy().flatten()",
4  "after_line": 220,
5  "start_line": 221,
6  "op": "replace",
7  "filename": "CORL/algorithms/offline/custom_adroit.py"
8}

Tool ResultStep 3

OK: Replaced lines 221..239 with 19 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–389.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–389 | total: 498 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU(),

AssistantStep 4test

Tool ResultStep 4

[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### pen-human-v1 (scripts/pen.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 actor_loss=0.0091 critic_loss=168.1975 value_loss=16.6622
TRAIN_METRICS step=997000 actor_loss=0.0065 critic_loss=178.6120 value_loss=14.3989
TRAIN_METRICS step=998000 actor_loss=0.0051 critic_loss=913.4302 value_loss=36.4418
TRAIN_METRICS step=999000 actor_loss=0.0063 critic_loss=161.8335 value_loss=21.5565
TRAIN_METRICS step=1000000 actor_loss=0.0080 critic_loss=117.1778 value_loss=11.2903
D4RL evaluation (pen-human-v1):
Evaluation over 10 episodes: 3822.210 , D4RL score: 125.008
Evaluation over 10 episodes: 3903.808 , D4RL score: 127.746
Evaluation over 10 episodes: 3925.793 , D4RL score: 128.483
Final D4RL score: 128.483

### hammer-human-v1 (scripts/hammer.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 actor_loss=0.0769 critic_loss=424.3253 value_loss=48.7755
TRAIN_METRICS step=997000 actor_loss=0.0356 critic

AssistantStep 5edit

Tool Input

1{
2  "end_line": 368,
3  "filename": "CORL/algorithms/offline/custom_adroit.py",
4  "start_line": 366,
5  "op": "replace",
6  "content": "            # adv = (adv - adv.mean()) / (adv.std() + 1e-8)  # REMOVED: destroys global value scale on cloned datasets!",
7  "after_line": 365
8}

Tool ResultStep 5

OK: Replaced lines 366..368 with 1 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–387.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–387 | total: 496 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU(),

AssistantStep 6edit

Tool Input

1{
2  "filename": "CORL/algorithms/offline/custom_adroit.py",
3  "after_line": 365,
4  "end_line": 365,
5  "start_line": 365,
6  "content": "            adv = q_act - v_act",
7  "op": "replace"
8}

Tool ResultStep 6

OK: Replaced lines 365..365 with 1 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–387.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–387 | total: 496 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU(),

AssistantStep 7edit

Tool Input

1{
2  "content": "             v_act = self.vf(states)",
3  "filename": "CORL/algorithms/offline/custom_adroit.py",
4  "end_line": 365,
5  "start_line": 365,
6  "op": "insert",
7  "after_line": 365
8}

Tool ResultStep 7

OK: Inserted 1 line(s) after line 365 in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–388.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–388 | total: 497 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU(),
   22

AssistantStep 8edit

Tool Input

1{
2  "end_line": 366,
3  "op": "replace",
4  "filename": "CORL/algorithms/offline/custom_adroit.py",
5  "after_line": 365,
6  "content": "            v_act = self.vf(states)\n            adv = q_act - v_act",
7  "start_line": 365
8}

Tool ResultStep 8

OK: Replaced lines 365..366 with 2 line(s) in CORL/algorithms/offline/custom_adroit.py. Editable range: 214–388.

[Current file: CORL/algorithms/offline/custom_adroit.py | editable: 214–388 | total: 497 lines]
   211| # tau, discount.
   212| # Example: CONFIG_OVERRIDES = {"normalize": False, "actor_lr": 1e-3}
   213| # =====================================================================
   214  CONFIG_OVERRIDES: Dict[str, Any] = {
   215      "normalize": True,
   216      "actor_lr": 3e-4,
   217      "critic_lr": 3e-4,
   218  }
   219  
   220  
   221  class DeterministicActor(nn.Module):
   222      def __init__(self, state_dim: int, action_dim: int, max_action: float, orthogonal_init: bool = True):
   223          super().__init__()
   224          self.max_action = max_action
   225          self.net = nn.Sequential(
   226              nn.Linear(state_dim, 256), nn.ReLU(),
   227              nn.Linear(256, 256), nn.ReLU(),
   228              nn.Linear(256, 256), nn.ReLU(),

AssistantStep 9test

Tool ResultStep 9

[Test #3] (0 tests remaining)


## Seed 42

### pen-human-v1 (scripts/pen.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 actor_loss=0.0059 critic_loss=168.1975 value_loss=16.6622
TRAIN_METRICS step=997000 actor_loss=0.0113 critic_loss=178.6120 value_loss=14.3989
TRAIN_METRICS step=998000 actor_loss=0.0079 critic_loss=913.4302 value_loss=36.4418
TRAIN_METRICS step=999000 actor_loss=0.0018 critic_loss=161.8335 value_loss=21.5565
TRAIN_METRICS step=1000000 actor_loss=0.0097 critic_loss=117.1778 value_loss=11.2903
D4RL evaluation (pen-human-v1):
Evaluation over 10 episodes: 3048.116 , D4RL score: 99.037
Evaluation over 10 episodes: 3121.335 , D4RL score: 101.493
Evaluation over 10 episodes: 3274.281 , D4RL score: 106.625
Final D4RL score: 106.625

### hammer-human-v1 (scripts/hammer.sh)
Training metrics (last steps):
TRAIN_METRICS step=996000 actor_loss=0.1469 critic_loss=424.3253 value_loss=48.7755
TRAIN_METRICS step=997000 actor_loss=0.1669 critic_loss=265.1998 value_loss=4