rl-offpolicy-continuous
Description
Online RL: Off-Policy Actor-Critic for Continuous Control
Objective
Design and implement an off-policy actor-critic RL algorithm for continuous control. Your code goes in custom_offpolicy_continuous.py. Three reference implementations (DDPG, TD3, SAC) are provided as read-only.
Background
Off-policy methods maintain a replay buffer of past experience and update the policy using data collected under previous policies. Key challenges include overestimation bias in Q-value estimates, exploration-exploitation tradeoff, and sample efficiency. Different approaches address these through twin critics, entropy regularization, target smoothing, or delayed updates.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
Code
1# Custom off-policy continuous RL algorithm for MLS-Bench2#3# EDITABLE section: Actor, QNetwork, and OffPolicyAlgorithm classes.4# FIXED sections: everything else (config, env, buffer, eval, training loop).5import os6import random7import time8from dataclasses import dataclass910import gymnasium as gym11import numpy as np12import torch13import torch.nn as nn14import torch.nn.functional as F15import torch.optim as optim
Results
| Model | Type | eval return halfcheetah v4 ↑ | eval return reacher v4 ↑ | eval return ant v4 ↑ |
|---|---|---|---|---|
| ddpg | baseline | 11514.817 | -3.943 | 1243.040 |
| sac | baseline | 10360.447 | -4.757 | 3577.097 |
| td3 | baseline | 11054.790 | -3.830 | 4642.473 |
| anthropic/claude-opus-4.6 | vanilla | 11347.473 | -3.763 | 5302.537 |
| deepseek-reasoner | vanilla | -532.790 | -10.950 | -453.183 |
| google/gemini-3.1-pro-preview | vanilla | -1.583 | -33.120 | 991.843 |
| gpt-5.4-pro | vanilla | 10252.330 | -3.713 | 2573.457 |
| openai/gpt-5.4-pro | vanilla | 6680.537 | -3.927 | 1959.657 |
| qwen3.6-plus | vanilla | -103.907 | -21.447 | 401.837 |
| anthropic/claude-opus-4.6 | agent | 11347.473 | -3.763 | 5302.537 |
| deepseek-reasoner | agent | 768.423 | -7.003 | 2903.493 |
| google/gemini-3.1-pro-preview | agent | 11557.677 | -3.760 | 5096.260 |
| gpt-5.4-pro | agent | 5173.637 | -4.337 | 1612.913 |
| openai/gpt-5.4-pro | agent | 10727.720 | -3.793 | 4773.530 |
| qwen3.6-plus | agent | -103.907 | -39.273 | 401.837 |