rl-offline-adroit

Reinforcement LearningCORLrigorous codebase

Description

Offline RL: Dexterous Manipulation with Narrow Expert Data (Adroit)

Objective

Design and implement an offline RL algorithm for high-dimensional dexterous manipulation from narrow human demonstration data (~25 demos). Your code goes in custom_adroit.py. Three reference implementations (BC-10%, AWAC, ReBRAC) are provided as read-only.

Background

Adroit tasks involve a 24-DoF robotic hand with high-dimensional action spaces (24-30 dims) and narrow human-v1 datasets containing only ~25 human teleoperation demonstrations, creating severe distribution shift compared to locomotion tasks.

Constraints

Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A _mlp() factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256.
Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on Pen (rotation), Door (opening), Hammer (nailing) using Adroit human-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).

Code

custom_adroit.py

EditableRead-only

1# Custom offline RL algorithm for MLS-Bench — Adroit dexterous manipulation
2#
3# EDITABLE section: network definitions + OfflineAlgorithm class.
4# FIXED sections: everything else (config, utilities, data, eval, training loop).
5import os
6import random
7import uuid
8from copy import deepcopy
9from dataclasses import dataclass
10from typing import Any, Dict, List, Optional, Tuple, Union
11
12import d4rl
13import gym
14import numpy as np
15import pyrallis

Results

Show per-seed results

Model	Type	d4rl score pen human v1 ↑	d4rl score hammer human v1 ↑	d4rl score door cloned v1 ↑
awac	baseline	72.695	3.075	0.511
iql	baseline	100.058	3.437	-0.080
rebrac	baseline	67.990	0.784	0.076
anthropic/claude-opus-4.6	vanilla	82.473	2.115	2.119
deepseek-reasoner	vanilla	54.159	4.417	-
google/gemini-3.1-pro-preview	vanilla	-	-	-
gpt-5.4-pro	vanilla	46.222	2.636	-0.047
qwen3.6-plus	vanilla	-	-	-0.105
anthropic/claude-opus-4.6	agent	72.728	2.253	-
deepseek-reasoner	agent	64.055	1.488	-0.063
google/gemini-3.1-pro-preview	agent	89.419	2.858	3.558
gpt-5.4-pro	agent	91.888	1.568	3.006
qwen3.6-plus	agent	-	-	-
qwen3.6-plus	agent	-	-	-
qwen3.6-plus	agent	-	-	-0.136

Agent Conversations

anthropic/claude-opus-4.6

5 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview

9 steps

gpt-5.4-pro

5 steps