Constraint Handling for Safe RL

Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target.

Reinforcement Learningomnisafe
safe-rl

Description

Safe RL: Constraint-Handling Mechanism Design

Research Question

Design a constraint-handling mechanism for safe reinforcement learning. Your code goes in custom_lag.py, a subclass of PPO registered as CustomLag. Reference implementations using a Lagrange multiplier (PPOLag) and a PID controller (CPPOPID) are provided as read-only *.edit.py baselines.

Background

Safe RL aims to maximize reward while keeping a long-run cost (e.g. the count of safety violations) below a fixed limit. The standard approach formulates the problem as a constrained MDP and converts it to an unconstrained dual problem via a multiplier lambda updated from the running cost violation. The mechanism that updates this multiplier and combines reward and cost advantages directly determines the agent's safety behavior:

  • naive — constraint-unaware PPO baseline that ignores the safety constraint entirely; provides an upper bound on reward with no cost control.
  • PPOLag — the multiplier is treated as a learnable parameter optimized by Adam to satisfy the dual objective. Simple but slow to react and prone to oscillation.
  • CPPOPID — Stooke, Achiam and Abbeel, "Responsive Safety in Reinforcement Learning by PID Lagrangian Methods" (arXiv:2007.03964, ICML 2020). Replaces the integral-only Lagrange update with a PID controller; the benchmark uses the paper-style CPPOPID configuration with gains kp = 0.1, ki = 0.01, kd = 0.01 and a derivative delay window of 10 epochs (matching omnisafe/common/pid_lagrange.py).

You must design:

  1. A multiplier update rule in _update().
  2. An advantage combination formula in _compute_adv_surrogate() that blends the reward advantage adv_r and cost advantage adv_c using the current multiplier (e.g. (adv_r - lam * adv_c) / (1 + lam) in the standard Lagrangian baseline).

The PPO rollout loop, value functions, optimizer, environment interface, and registration plumbing are fixed.

Evaluation

Evaluated on Safety-Gymnasium navigation environments including:

  • SafetyPointGoal1-v0 — point robot navigating to goals while avoiding hazards.
  • SafetyCarGoal1-v0 — non-holonomic car robot with the same goal structure.
  • SafetyPointButton1-v0 — point robot pressing goal buttons while avoiding hazards.

Each environment trains for the benchmark's fixed step budget. Metrics:

  • Episode return (reward) — higher is better.
  • Episode cost (cost) — lower is better, with a target threshold of 25.0 per the Safety-Gymnasium convention used in omnisafe.

A method should achieve high return only when the cost constraint is controlled across all environments.

Code

custom_lag.py
EditableRead-only
1"""Custom Lagrangian-based safe PPO for MLS-Bench.
2
3EDITABLE section: imports + constraint handling methods.
4FIXED sections: algorithm registration, learn() with metrics reporting.
5"""
6
7from __future__ import annotations
8
9import time
10
11import numpy as np
12import torch
13
14from omnisafe.algorithms import registry
15from omnisafe.algorithms.on_policy.base.ppo import PPO
lagrange.py
EditableRead-only
1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of Lagrange."""
pid_lagrange.py
EditableRead-only
1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of PID Lagrange."""
ppo.py
EditableRead-only
1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of the PPO algorithm."""

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.
Baselines
Agents
Claude Opus 4.6·Pseudocodehigh

Adaptive PID-Lag with asymmetric advantage

PID Lagrangian targeting 80% of the cost limit, with violation-quadratic gain scheduling, anti-windup integral, predictive lookahead, and 1.5x cost penalty on cost-increasing actions.

Per epoch:
1. δ=Jc0.8clim\delta = J_c - 0.8\,c_{\lim};\;v=max(0,δ/clim)v = \max(0, \delta/c_{\lim})
2. gain scale σ=1+3v+2v2\sigma = 1+3v+2v^2;\;kp ⁣= ⁣0.15σ,  ki ⁣= ⁣0.02σ,  kd ⁣= ⁣0.015k_p\!=\!0.15\sigma,\;k_i\!=\!0.02\sigma,\;k_d\!=\!0.015
3. EMA smoothed δˉ0.85δˉ+0.15δ\bar\delta \leftarrow 0.85\bar\delta + 0.15\delta
4. integral: if δ>0\delta>0 add kiδk_i\delta else add 0.33kiδ0.33 k_i \delta;\;clip [0,10][0, 10]
5. EMA cost Jˉ0.85Jˉ+0.15Jc\bar J \leftarrow 0.85\bar J + 0.15 J_c;\;d=max(0,JˉJˉ10)d = \max(0, \bar J - \bar J_{-10})
6. predictive trend p=0.03max(0,JˉJˉ5)p = 0.03\max(0, \bar J - \bar J_{-5})
7. λ=max(0,kpδˉ+I+kdd+p)\lambda = \max(0, k_p\bar\delta + I + k_d d + p)
8. A=(Arλ(1+0.51[Ac>0])Ac)/(1+λ)A = (A_r - \lambda(1+0.5\,\mathbb{1}[A_c>0])\,A_c)/(1+\lambda)
Δ vs. baselineExtends the pid_lag baseline with violation-quadratic gain scheduling targeting 80% of the cost limit, asymmetric integral (fast up, slow down), a predictive trend term, and an asymmetric cost-advantage combination.
kp_base=0.15ki_base=0.02kd_base=0.015safety_margin=0.8 * cost_limitintegral_max=10.0windup_asymmetry=1.0 / 0.33ema_alpha=0.85 / 0.15predict_weight=0.03asymmetry=0.5Recovers PID-Lag baseline at violation_ratio=0 and asymmetry=0

Results