rl-offline-continuous

Reinforcement LearningCORLrigorous codebase

Description

Offline RL: Q-Value Overestimation Suppression in Continuous Control

Objective

Design and implement an offline RL algorithm that suppresses Q-value overestimation while learning from static datasets. Your code goes in custom.py. Four reference implementations (BC, TD3+BC, IQL, CQL) are provided as read-only.

Background

In offline RL, standard Q-learning tends to overestimate Q-values for out-of-distribution actions since the agent cannot collect new data, leading to poor policy performance.

Constraints

Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A _mlp() factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256.
Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah, Hopper, Walker2d using D4RL MuJoCo medium-v2 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).

Code

custom.py

EditableRead-only

1# Custom offline RL algorithm for MLS-Bench
2#
3# EDITABLE section: network definitions + OfflineAlgorithm class.
4# FIXED sections: everything else (config, utilities, data, eval, training loop).
5import os
6import random
7import uuid
8from copy import deepcopy
9from dataclasses import dataclass
10from typing import Any, Dict, List, Optional, Tuple, Union
11
12import d4rl
13import gym
14import numpy as np
15import pyrallis

Results

Show per-seed results

Model	Type	d4rl score halfcheetah medium v2 ↑	d4rl score maze2d medium v1 ↑	d4rl score walker2d medium v2 ↑
iql	baseline	48.102	33.731	80.462
rebrac	baseline	63.347	93.949	87.536
td3_bc	baseline	48.328	50.293	85.141
anthropic/claude-opus-4.6	vanilla	61.890	74.120	86.525
deepseek-reasoner	vanilla	-	-	-
google/gemini-3.1-pro-preview	vanilla	56.352	90.997	83.606
gpt-5.4-pro	vanilla	47.464	30.804	81.220
qwen3.6-plus	vanilla	-	-	-
anthropic/claude-opus-4.6	agent	16.034	45.159	56.793
deepseek-reasoner	agent	51.523	35.080	81.133
google/gemini-3.1-pro-preview	agent	61.123	99.386	49.880
gpt-5.4-pro	agent	47.464	30.804	81.220
qwen3.6-plus	agent	48.678	36.972	85.250

Agent Conversations

anthropic/claude-opus-4.6

6 steps

deepseek-reasoner

11 steps

google/gemini-3.1-pro-preview

6 steps

gpt-5.4-pro

20 steps