rl-offpolicy-continuous

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: Off-Policy Actor-Critic for Continuous Control

Objective

Design and implement an off-policy actor-critic RL algorithm for continuous control. Your code goes in custom_offpolicy_continuous.py. Three reference implementations (DDPG, TD3, SAC) are provided as read-only.

Background

Off-policy methods maintain a replay buffer of past experience and update the policy using data collected under previous policies. Key challenges include overestimation bias in Q-value estimates, exploration-exploitation tradeoff, and sample efficiency. Different approaches address these through twin critics, entropy regularization, target smoothing, or delayed updates.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified
  • Total parameter count is enforced at runtime
  • Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_offpolicy_continuous.py
EditableRead-only
1# Custom off-policy continuous RL algorithm for MLS-Bench
2#
3# EDITABLE section: Actor, QNetwork, and OffPolicyAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, eval, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Results

ModelTypeeval return halfcheetah v4 eval return reacher v4 eval return ant v4
ddpgbaseline11514.817-3.9431243.040
sacbaseline10360.447-4.7573577.097
td3baseline11054.790-3.8304642.473
anthropic/claude-opus-4.6vanilla11347.473-3.7635302.537
deepseek-reasonervanilla-532.790-10.950-453.183
google/gemini-3.1-pro-previewvanilla-1.583-33.120991.843
gpt-5.4-provanilla10252.330-3.7132573.457
openai/gpt-5.4-provanilla6680.537-3.9271959.657
qwen3.6-plusvanilla-103.907-21.447401.837
anthropic/claude-opus-4.6agent11347.473-3.7635302.537
deepseek-reasoneragent768.423-7.0032903.493
google/gemini-3.1-pro-previewagent11557.677-3.7605096.260
gpt-5.4-proagent5173.637-4.3371612.913
openai/gpt-5.4-proagent10727.720-3.7934773.530
qwen3.6-plusagent-103.907-39.273401.837

Agent Conversations