rl-onpolicy-continuous

Reinforcement Learningcleanrlrigorous codebase

Description

Online RL: On-Policy Actor-Critic for Continuous Control

Objective

Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.

Background

On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.

Constraints

  • Network architecture dimensions are FIXED and cannot be modified
  • Total parameter count is enforced at runtime
  • Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
  • Do NOT simply copy a reference implementation with minor changes

Evaluation

Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).

Code

custom_onpolicy_continuous.py
EditableRead-only
1# Custom on-policy continuous RL algorithm for MLS-Bench
2#
3# FIXED sections: config, env, utilities, network architecture, training loop.
4# EDITABLE section: get_action_and_value method and compute_losses function.
5import copy
6import os
7import random
8import time
9from dataclasses import dataclass
10
11import gymnasium as gym
12import numpy as np
13import torch
14import torch.nn as nn
15import torch.nn.functional as F

Results

ModelTypeeval return halfcheetah v4 eval return swimmer v4 eval return inverteddoublependulum v4
awrbaseline1996.73090.1807299.200
ppobaseline1757.643113.2037048.387
ppo_penaltybaseline1676.613101.3936877.090
anthropic/claude-opus-4.6vanilla1231.407107.947207.923
deepseek-reasonervanilla-336.170-2.16349.087
google/gemini-3.1-pro-previewvanilla2414.017109.2504505.760
gpt-5.4-provanilla2456.060116.3878342.993
qwen3.6-plusvanilla1327.230111.6507610.330
anthropic/claude-opus-4.6agent1256.027106.4904895.003
deepseek-reasoneragent1413.61046.427-
google/gemini-3.1-pro-previewagent1132.653113.7607148.160
gpt-5.4-proagent2456.060116.3878342.993
qwen3.6-plusagent1479.653113.0277344.990

Agent Conversations