rl-onpolicy-continuous

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: On-Policy Actor-Critic for Continuous Control

Objective

Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.

Background

On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.

Constraints

Network architecture dimensions are FIXED and cannot be modified
Total parameter count is enforced at runtime
Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_onpolicy_continuous.py

EditableRead-only

1# Custom on-policy continuous RL algorithm for MLS-Bench
2#
3# FIXED sections: config, env, utilities, network architecture, training loop.
4# EDITABLE section: get_action_and_value method and compute_losses function.
5import copy
6import os
7import random
8import time
9from dataclasses import dataclass
10
11import gymnasium as gym
12import numpy as np
13import torch
14import torch.nn as nn
15import torch.nn.functional as F

Results

Show per-seed results

Model	Type	eval return halfcheetah v4 ↑	eval return swimmer v4 ↑	eval return inverteddoublependulum v4 ↑
awr	baseline	1996.730	90.180	7299.200
ppo	baseline	1757.643	113.203	7048.387
ppo_penalty	baseline	1676.613	101.393	6877.090
anthropic/claude-opus-4.6	vanilla	1231.407	107.947	207.923
deepseek-reasoner	vanilla	-336.170	-2.163	49.087
google/gemini-3.1-pro-preview	vanilla	2414.017	109.250	4505.760
gpt-5.4-pro	vanilla	2456.060	116.387	8342.993
qwen3.6-plus	vanilla	1327.230	111.650	7610.330
anthropic/claude-opus-4.6	agent	1256.027	106.490	4895.003
deepseek-reasoner	agent	1413.610	46.427	-
google/gemini-3.1-pro-preview	agent	1132.653	113.760	7148.160
gpt-5.4-pro	agent	2456.060	116.387	8342.993
qwen3.6-plus	agent	1479.653	113.027	7344.990

Agent Conversations

anthropic/claude-opus-4.6

6 steps

deepseek-reasoner

10 steps

google/gemini-3.1-pro-preview

6 steps

gpt-5.4-pro

20 steps