rl-onpolicy-continuous
Description
Online RL: On-Policy Actor-Critic for Continuous Control
Objective
Design and implement an on-policy actor-critic RL algorithm for continuous control. Your code goes in custom_onpolicy_continuous.py. Three reference implementations (PPO, RPO, PPO-Penalty) are provided as read-only.
Background
On-policy methods collect trajectories using the current policy, compute advantages via Generalized Advantage Estimation (GAE), and update the policy using mini-batch optimization. Key challenges include sample efficiency, stability of policy updates, and balancing exploration with exploitation. Different approaches address these through clipped surrogate objectives, stochasticity injection, or direct policy gradient estimation.
Constraints
- Network architecture dimensions are FIXED and cannot be modified
- Total parameter count is enforced at runtime
- Focus on algorithmic innovation: new loss functions, update rules, exploration strategies, etc.
- Do NOT simply copy a reference implementation with minor changes
Evaluation
Trained and evaluated on HalfCheetah-v4, Hopper-v4, Walker2d-v4. Additional held-out environments (not shown during intermediate testing) are used to assess generalization. Metric: mean episodic return over 10 evaluation episodes (higher is better).
Code
1# Custom on-policy continuous RL algorithm for MLS-Bench2#3# FIXED sections: config, env, utilities, network architecture, training loop.4# EDITABLE section: get_action_and_value method and compute_losses function.5import copy6import os7import random8import time9from dataclasses import dataclass1011import gymnasium as gym12import numpy as np13import torch14import torch.nn as nn15import torch.nn.functional as F
Results
| Model | Type | eval return halfcheetah v4 ↑ | eval return swimmer v4 ↑ | eval return inverteddoublependulum v4 ↑ |
|---|---|---|---|---|
| awr | baseline | 1996.730 | 90.180 | 7299.200 |
| ppo | baseline | 1757.643 | 113.203 | 7048.387 |
| ppo_penalty | baseline | 1676.613 | 101.393 | 6877.090 |
| anthropic/claude-opus-4.6 | vanilla | 1231.407 | 107.947 | 207.923 |
| deepseek-reasoner | vanilla | -336.170 | -2.163 | 49.087 |
| google/gemini-3.1-pro-preview | vanilla | 2414.017 | 109.250 | 4505.760 |
| gpt-5.4-pro | vanilla | 2456.060 | 116.387 | 8342.993 |
| qwen3.6-plus | vanilla | 1327.230 | 111.650 | 7610.330 |
| anthropic/claude-opus-4.6 | agent | 1256.027 | 106.490 | 4895.003 |
| deepseek-reasoner | agent | 1413.610 | 46.427 | - |
| google/gemini-3.1-pro-preview | agent | 1132.653 | 113.760 | 7148.160 |
| gpt-5.4-pro | agent | 2456.060 | 116.387 | 8342.993 |
| qwen3.6-plus | agent | 1479.653 | 113.027 | 7344.990 |