safe-rl

Reinforcement Learningomnisaferigorous codebase

Description

Safe RL: Constraint-Handling Mechanism Design

Objective

Design a constraint-handling mechanism for safe reinforcement learning. Your code goes in custom_lag.py, a subclass of PPO registered as CustomLag. Reference implementations (PPOLag using Lagrange multiplier, CPPOPID using PID controller) are provided as read-only.

Background

Safe RL aims to maximize reward while satisfying safety constraints (keeping episode cost below a limit). The key challenge is how to adaptively balance reward and cost: the Lagrangian approach converts the constrained problem to an unconstrained dual problem via a multiplier lambda, while PID methods use control theory for more responsive constraint satisfaction. You must design: (1) a multiplier update rule in _update(), and (2) an advantage combination formula in _compute_adv_surrogate().

Evaluation

Evaluated on 3 Safety-Gymnasium environments to test generalization:

  • SafetyPointGoal1-v0: point robot navigating to goals while avoiding hazards
  • SafetyCarGoal1-v0: car robot (non-holonomic) navigating to goals while avoiding hazards
  • SafetyPointButton1-v0: point robot pressing goal buttons while avoiding hazards

Metrics: episode reward (higher is better) and episode cost (lower is better, target <= 25.0). Each environment trains for 2M steps.

Baselines

  • naive: no constraint handling (pure PPO, ignores cost)
  • ppo_lag: Lagrangian multiplier updated via Adam optimizer
  • pid_lag: PID controller for multiplier update

Code

custom_lag.py
EditableRead-only
1"""Custom Lagrangian-based safe PPO for MLS-Bench.
2
3EDITABLE section: imports + constraint handling methods.
4FIXED sections: algorithm registration, learn() with metrics reporting.
5"""
6
7from __future__ import annotations
8
9import time
10
11import numpy as np
12import torch
13
14from omnisafe.algorithms import registry
15from omnisafe.algorithms.on_policy.base.ppo import PPO

Additional context files (read-only):

  • omnisafe/omnisafe/common/lagrange.py
  • omnisafe/omnisafe/common/pid_lagrange.py
  • omnisafe/omnisafe/algorithms/on_policy/base/ppo.py

Results

ModelTypeep ret SafetyPointGoal1 v0 ep cost SafetyPointGoal1 v0 ep ret SafetyCarGoal1 v0 ep cost SafetyCarGoal1 v0 ep ret SafetyPointButton1 v0 ep cost SafetyPointButton1 v0
naivebaseline25.53651.41732.82460.71719.691152.430
pid_lagbaseline0.20019.2701.99619.3030.46424.970
ppo_lagbaseline15.14145.58018.58346.7134.01056.223
claude-opus-4.6vanilla0.58212.9971.75116.350-0.23725.463
deepseek-reasonervanilla-0.45366.290-0.54644.3930.25275.453
gemini-3.1-pro-previewvanilla12.92137.7477.42041.4133.57145.103
gpt-5.4vanilla--0.15210.960--
qwen3.6-plusvanilla24.40249.53730.16454.36319.700136.417
claude-opus-4.6agent0.58212.9971.75116.350-0.23725.463
deepseek-reasoneragent-0.45366.290-0.54644.3930.25275.453
gemini-3.1-pro-previewagent12.92137.7477.42041.4133.57145.103
gpt-5.4agent--0.15210.960--
qwen3.6-plusagent24.40249.53730.16454.36319.700136.417

Agent Conversations