rl-offline-off2on

Reinforcement LearningCORLrigorous codebase

Description

Offline-to-Online RL: Preventing Catastrophic Forgetting in Fine-Tuning

Objective

Design and implement an offline-to-online RL algorithm that pretrains from an offline dataset (1M steps), then fine-tunes with online interaction (1M steps) without catastrophic forgetting or Q-value collapse. Your code goes in custom_finetune.py. Three reference implementations (AWAC, SPOT, Cal-QL) are provided as read-only.

Background

The critical challenge is the offline-to-online transition: naive fine-tuning often causes Q-value collapse (conservative estimates become overoptimistic) and catastrophic forgetting. The Adroit cloned-v1 datasets mix expert and noisy demonstrations, making this transition particularly challenging.

Constraints

Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A _mlp() factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256.
Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on Pen, Door, Hammer using Adroit cloned-v1 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert), evaluated throughout both phases.

Code

custom_finetune.py

EditableRead-only

1# Custom offline-to-online RL algorithm for MLS-Bench — Adroit fine-tuning
2#
3# EDITABLE section: network definitions + OfflineOnlineAlgorithm class.
4# FIXED sections: everything else (config, utilities, data, eval, training loop).
5import os
6import random
7import uuid
8from copy import deepcopy
9from dataclasses import dataclass
10from typing import Any, Dict, List, Optional, Tuple, Union
11
12import d4rl
13import gym
14import numpy as np
15import pyrallis

Results

Show per-seed results

Model	Type	d4rl score pen cloned v1 ↑	d4rl score hammer cloned v1 ↑	d4rl score hammer expert v1 ↑
awac	baseline	73.377	0.207	126.890
iql	baseline	98.194	2.637	118.209
spot	baseline	77.466	2.444	74.058
anthropic/claude-opus-4.6	vanilla	80.928	1.298	125.061
deepseek-reasoner	vanilla	32.736	0.240	102.163
google/gemini-3.1-pro-preview	vanilla	40.102	4.427	51.188
openai/gpt-5.4-pro	vanilla	74.896	1.563	88.278
qwen/qwen3.6-plus	vanilla	-	-	-
qwen/qwen3.6-plus	vanilla	22.497	0.262	52.642
qwen/qwen3.6-plus	vanilla	-	-	-
qwen3.6-plus	vanilla	-	-	-
qwen3.6-plus	vanilla	22.497	0.262	52.642
qwen3.6-plus	vanilla	-	-	-
anthropic/claude-opus-4.6	agent	89.276	4.637	98.176
deepseek-reasoner	agent	38.255	0.165	120.705
google/gemini-3.1-pro-preview	agent	71.798	5.147	129.376
openai/gpt-5.4-pro	agent	-	-	82.010
qwen/qwen3.6-plus	agent	22.497	0.262	52.642
qwen3.6-plus	agent	22.497	0.262	52.642
qwen3.6-plus	agent	22.497	0.262	52.642