rl-offpolicy-continuous

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: Off-Policy Actor-Critic for Continuous Control

Objective

Design and implement an off-policy actor-critic RL algorithm for continuous control. Your code goes in custom_offpolicy_continuous.py. Three reference implementations (DDPG, TD3, SAC) are provided as read-only.

Background

Off-policy methods maintain a replay buffer of past experience and update the policy using data collected under previous policies. Key challenges include overestimation bias in Q-value estimates, exploration-exploitation tradeoff, and sample efficiency. Different approaches address these through twin critics, entropy regularization, target smoothing, or delayed updates.

Constraints

Network architecture dimensions are FIXED and cannot be modified
Total parameter count is enforced at runtime
Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_offpolicy_continuous.py

EditableRead-only

1# Custom off-policy continuous RL algorithm for MLS-Bench
2#
3# EDITABLE section: Actor, QNetwork, and OffPolicyAlgorithm classes.
4# FIXED sections: everything else (config, env, buffer, eval, training loop).
5import os
6import random
7import time
8from dataclasses import dataclass
9
10import gymnasium as gym
11import numpy as np
12import torch
13import torch.nn as nn
14import torch.nn.functional as F
15import torch.optim as optim

Results

Show per-seed results

Model	Type	eval return halfcheetah v4 ↑	eval return reacher v4 ↑	eval return ant v4 ↑
ddpg	baseline	11514.817	-3.943	1243.040
sac	baseline	10360.447	-4.757	3577.097
td3	baseline	11054.790	-3.830	4642.473
anthropic/claude-opus-4.6	vanilla	11347.473	-3.763	5302.537
deepseek-reasoner	vanilla	-532.790	-10.950	-453.183
google/gemini-3.1-pro-preview	vanilla	-1.583	-33.120	991.843
gpt-5.4-pro	vanilla	10252.330	-3.713	2573.457
openai/gpt-5.4-pro	vanilla	6680.537	-3.927	1959.657
qwen3.6-plus	vanilla	-103.907	-21.447	401.837
anthropic/claude-opus-4.6	agent	11347.473	-3.763	5302.537
deepseek-reasoner	agent	768.423	-7.003	2903.493
google/gemini-3.1-pro-preview	agent	11557.677	-3.760	5096.260
gpt-5.4-pro	agent	5173.637	-4.337	1612.913
openai/gpt-5.4-pro	agent	10727.720	-3.793	4773.530
qwen3.6-plus	agent	-103.907	-39.273	401.837

Agent Conversations

anthropic/claude-opus-4.6

5 steps

deepseek-reasoner

8 steps

google/gemini-3.1-pro-preview