safe-rl
Description
Safe RL: Constraint-Handling Mechanism Design
Objective
Design a constraint-handling mechanism for safe reinforcement learning. Your code goes in custom_lag.py, a subclass of PPO registered as CustomLag. Reference implementations (PPOLag using Lagrange multiplier, CPPOPID using PID controller) are provided as read-only.
Background
Safe RL aims to maximize reward while satisfying safety constraints (keeping episode cost below a limit). The key challenge is how to adaptively balance reward and cost: the Lagrangian approach converts the constrained problem to an unconstrained dual problem via a multiplier lambda, while PID methods use control theory for more responsive constraint satisfaction. You must design: (1) a multiplier update rule in _update(), and (2) an advantage combination formula in _compute_adv_surrogate().
Evaluation
Evaluated on 3 Safety-Gymnasium environments to test generalization:
- SafetyPointGoal1-v0: point robot navigating to goals while avoiding hazards
- SafetyCarGoal1-v0: car robot (non-holonomic) navigating to goals while avoiding hazards
- SafetyPointButton1-v0: point robot pressing goal buttons while avoiding hazards
Metrics: episode reward (higher is better) and episode cost (lower is better, target <= 25.0). Each environment trains for 2M steps.
Baselines
- naive: no constraint handling (pure PPO, ignores cost)
- ppo_lag: Lagrangian multiplier updated via Adam optimizer
- pid_lag: PID controller for multiplier update
Code
1"""Custom Lagrangian-based safe PPO for MLS-Bench.23EDITABLE section: imports + constraint handling methods.4FIXED sections: algorithm registration, learn() with metrics reporting.5"""67from __future__ import annotations89import time1011import numpy as np12import torch1314from omnisafe.algorithms import registry15from omnisafe.algorithms.on_policy.base.ppo import PPO
Additional context files (read-only):
omnisafe/omnisafe/common/lagrange.pyomnisafe/omnisafe/common/pid_lagrange.pyomnisafe/omnisafe/algorithms/on_policy/base/ppo.py
Results
| Model | Type | ep ret SafetyPointGoal1 v0 ↑ | ep cost SafetyPointGoal1 v0 ↑ | ep ret SafetyCarGoal1 v0 ↑ | ep cost SafetyCarGoal1 v0 ↑ | ep ret SafetyPointButton1 v0 ↑ | ep cost SafetyPointButton1 v0 ↑ |
|---|---|---|---|---|---|---|---|
| naive | baseline | 25.536 | 51.417 | 32.824 | 60.717 | 19.691 | 152.430 |
| pid_lag | baseline | 0.200 | 19.270 | 1.996 | 19.303 | 0.464 | 24.970 |
| ppo_lag | baseline | 15.141 | 45.580 | 18.583 | 46.713 | 4.010 | 56.223 |
| claude-opus-4.6 | vanilla | 0.582 | 12.997 | 1.751 | 16.350 | -0.237 | 25.463 |
| deepseek-reasoner | vanilla | -0.453 | 66.290 | -0.546 | 44.393 | 0.252 | 75.453 |
| gemini-3.1-pro-preview | vanilla | 12.921 | 37.747 | 7.420 | 41.413 | 3.571 | 45.103 |
| gpt-5.4 | vanilla | - | - | 0.152 | 10.960 | - | - |
| qwen3.6-plus | vanilla | 24.402 | 49.537 | 30.164 | 54.363 | 19.700 | 136.417 |
| claude-opus-4.6 | agent | 0.582 | 12.997 | 1.751 | 16.350 | -0.237 | 25.463 |
| deepseek-reasoner | agent | -0.453 | 66.290 | -0.546 | 44.393 | 0.252 | 75.453 |
| gemini-3.1-pro-preview | agent | 12.921 | 37.747 | 7.420 | 41.413 | 3.571 | 45.103 |
| gpt-5.4 | agent | - | - | 0.152 | 10.960 | - | - |
| qwen3.6-plus | agent | 24.402 | 49.537 | 30.164 | 54.363 | 19.700 | 136.417 |