rl-offline-continuous
Reinforcement LearningCORLrigorous codebase
Description
Offline RL: Q-Value Overestimation Suppression in Continuous Control
Objective
Design and implement an offline RL algorithm that suppresses Q-value overestimation while learning from static datasets. Your code goes in custom.py. Four reference implementations (BC, TD3+BC, IQL, CQL) are provided as read-only.
Background
In offline RL, standard Q-learning tends to overestimate Q-values for out-of-distribution actions since the agent cannot collect new data, leading to poor policy performance.
Constraints
- Network dimensions are fixed at 256. All MLP hidden layers must use 256 units. A
_mlp()factory function is provided in the FIXED section for convenience. You may define custom network classes but hidden widths must remain 256. - Total parameter count is enforced. The training loop checks that total trainable parameters do not exceed 1.2x the largest baseline architecture. Focus on algorithmic innovations (loss functions, regularization, training procedures), not network capacity.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah, Hopper, Walker2d using D4RL MuJoCo medium-v2 datasets. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: D4RL normalized score (0 = random, 100 = expert).
Code
custom.py
EditableRead-only
1# Custom offline RL algorithm for MLS-Bench2#3# EDITABLE section: network definitions + OfflineAlgorithm class.4# FIXED sections: everything else (config, utilities, data, eval, training loop).5import os6import random7import uuid8from copy import deepcopy9from dataclasses import dataclass10from typing import Any, Dict, List, Optional, Tuple, Union1112import d4rl13import gym14import numpy as np15import pyrallis
Results
| Model | Type | d4rl score halfcheetah medium v2 ↑ | d4rl score maze2d medium v1 ↑ | d4rl score walker2d medium v2 ↑ |
|---|---|---|---|---|
| iql | baseline | 48.102 | 33.731 | 80.462 |
| rebrac | baseline | 63.347 | 93.949 | 87.536 |
| td3_bc | baseline | 48.328 | 50.293 | 85.141 |
| anthropic/claude-opus-4.6 | vanilla | 61.890 | 74.120 | 86.525 |
| deepseek-reasoner | vanilla | - | - | - |
| google/gemini-3.1-pro-preview | vanilla | 56.352 | 90.997 | 83.606 |
| gpt-5.4-pro | vanilla | 47.464 | 30.804 | 81.220 |
| qwen3.6-plus | vanilla | - | - | - |
| anthropic/claude-opus-4.6 | agent | 16.034 | 45.159 | 56.793 |
| deepseek-reasoner | agent | 51.523 | 35.080 | 81.133 |
| google/gemini-3.1-pro-preview | agent | 61.123 | 99.386 | 49.880 |
| gpt-5.4-pro | agent | 47.464 | 30.804 | 81.220 |
| qwen3.6-plus | agent | 48.678 | 36.972 | 85.250 |