rl-offline-off2on
Description
Offline-to-Online RL: Preventing Catastrophic Forgetting in Fine-Tuning
Objective
Design and implement an offline-to-online RL algorithm that pretrains from an offline dataset (1M steps), then fine-tunes with online interaction (1M steps) without catastrophic forgetting or Q-value collapse. Your code goes in custom_finetune.py. Three reference implementations (AWAC, SPOT, Cal-QL) are provided as read-only.
Background
The critical challenge is the offline-to-online transition: naive fine-tuning often causes Q-value collapse (conservative estimates become overoptimistic) and catastrophic forgetting. The Adroit cloned-v1 datasets mix expert and noisy demonstrations, making this transition particularly challenging.
Constraints
- Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A
_mlp()factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256. - Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on Pen, Door, Hammer using Adroit cloned-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert), evaluated throughout both phases.
Code
1# Custom offline-to-online RL algorithm for MLS-Bench — Adroit fine-tuning2#3# EDITABLE section: network definitions + OfflineOnlineAlgorithm class.4# FIXED sections: everything else (config, utilities, data, eval, training loop).5import os6import random7import uuid8from copy import deepcopy9from dataclasses import dataclass10from typing import Any, Dict, List, Optional, Tuple, Union1112import d4rl13import gym14import numpy as np15import pyrallis
Results
| Model | Type | d4rl score pen cloned v1 ↑ | d4rl score hammer cloned v1 ↑ | d4rl score hammer expert v1 ↑ |
|---|---|---|---|---|
| awac | baseline | 73.377 | 0.207 | 126.890 |
| iql | baseline | 98.194 | 2.637 | 118.209 |
| spot | baseline | 77.466 | 2.444 | 74.058 |
| anthropic/claude-opus-4.6 | vanilla | 80.928 | 1.298 | 125.061 |
| deepseek-reasoner | vanilla | 32.736 | 0.240 | 102.163 |
| google/gemini-3.1-pro-preview | vanilla | 40.102 | 4.427 | 51.188 |
| openai/gpt-5.4-pro | vanilla | 74.896 | 1.563 | 88.278 |
| qwen/qwen3.6-plus | vanilla | - | - | - |
| qwen/qwen3.6-plus | vanilla | 22.497 | 0.262 | 52.642 |
| qwen/qwen3.6-plus | vanilla | - | - | - |
| qwen3.6-plus | vanilla | - | - | - |
| qwen3.6-plus | vanilla | 22.497 | 0.262 | 52.642 |
| qwen3.6-plus | vanilla | - | - | - |
| anthropic/claude-opus-4.6 | agent | 89.276 | 4.637 | 98.176 |
| deepseek-reasoner | agent | 38.255 | 0.165 | 120.705 |
| google/gemini-3.1-pro-preview | agent | 71.798 | 5.147 | 129.376 |
| openai/gpt-5.4-pro | agent | - | - | 82.010 |
| qwen/qwen3.6-plus | agent | 22.497 | 0.262 | 52.642 |
| qwen3.6-plus | agent | 22.497 | 0.262 | 52.642 |
| qwen3.6-plus | agent | 22.497 | 0.262 | 52.642 |