Constraint Handling for Safe RL

Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target.

Reinforcement Learningomnisafe

safe-rl

Description

Safe RL: Constraint-Handling Mechanism Design

Research Question

Design a constraint-handling mechanism for safe reinforcement learning. Your code goes in custom_lag.py, a subclass of PPO registered as CustomLag. Reference implementations using a Lagrange multiplier (PPOLag) and a PID controller (CPPOPID) are provided as read-only *.edit.py baselines.

Background

Safe RL aims to maximize reward while keeping a long-run cost (e.g. the count of safety violations) below a fixed limit. The standard approach formulates the problem as a constrained MDP and converts it to an unconstrained dual problem via a multiplier lambda updated from the running cost violation. The mechanism that updates this multiplier and combines reward and cost advantages directly determines the agent's safety behavior:

naive — constraint-unaware PPO baseline that ignores the safety constraint entirely; provides an upper bound on reward with no cost control.
PPOLag — the multiplier is treated as a learnable parameter optimized by Adam to satisfy the dual objective. Simple but slow to react and prone to oscillation.
CPPOPID — Stooke, Achiam and Abbeel, "Responsive Safety in Reinforcement Learning by PID Lagrangian Methods" (arXiv:2007.03964, ICML 2020). Replaces the integral-only Lagrange update with a PID controller; the benchmark uses the paper-style CPPOPID configuration with gains kp = 0.1, ki = 0.01, kd = 0.01 and a derivative delay window of 10 epochs (matching omnisafe/common/pid_lagrange.py).

You must design:

A multiplier update rule in _update().
An advantage combination formula in _compute_adv_surrogate() that blends the reward advantage adv_r and cost advantage adv_c using the current multiplier (e.g. (adv_r - lam * adv_c) / (1 + lam) in the standard Lagrangian baseline).

The PPO rollout loop, value functions, optimizer, environment interface, and registration plumbing are fixed.

Evaluation

Evaluated on Safety-Gymnasium navigation environments including:

SafetyPointGoal1-v0 — point robot navigating to goals while avoiding hazards.
SafetyCarGoal1-v0 — non-holonomic car robot with the same goal structure.
SafetyPointButton1-v0 — point robot pressing goal buttons while avoiding hazards.

Each environment trains for the benchmark's fixed step budget. Metrics:

Episode return (reward) — higher is better.
Episode cost (cost) — lower is better, with a target threshold of 25.0 per the Safety-Gymnasium convention used in omnisafe.

A method should achieve high return only when the cost constraint is controlled across all environments.

Code

custom_lag.py

EditableRead-only

1"""Custom Lagrangian-based safe PPO for MLS-Bench.
2
3EDITABLE section: imports + constraint handling methods.
4FIXED sections: algorithm registration, learn() with metrics reporting.
5"""
6
7from __future__ import annotations
8
9import time
10
11import numpy as np
12import torch
13
14from omnisafe.algorithms import registry
15from omnisafe.algorithms.on_policy.base.ppo import PPO

lagrange.py

EditableRead-only

1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7#     http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of Lagrange."""

pid_lagrange.py

EditableRead-only

1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7#     http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of PID Lagrange."""

ppo.py

EditableRead-only

1# Copyright 2023 OmniSafe Team. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7#     http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# ==============================================================================
15"""Implementation of the PPO algorithm."""

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.

Baselines

Agents

Claude Opus 4.6·Pseudocodehigh

Adaptive PID-Lag with asymmetric advantage

PID Lagrangian targeting 80% of the cost limit, with violation-quadratic gain scheduling, anti-windup integral, predictive lookahead, and 1.5x cost penalty on cost-increasing actions.

Per epoch:
1.  $\delta = J_c - 0.8\,c_{\lim}$ ;\; $v = \max(0, \delta/c_{\lim})$ 
2. gain scale  $\sigma = 1+3v+2v^2$ ;\; $k_p\!=\!0.15\sigma,\;k_i\!=\!0.02\sigma,\;k_d\!=\!0.015$ 
3. EMA smoothed  $\bar\delta \leftarrow 0.85\bar\delta + 0.15\delta$ 
4. integral: if  $\delta>0$  add  $k_i\delta$  else add  $0.33 k_i \delta$ ;\;clip  $[0, 10]$ 
5. EMA cost  $\bar J \leftarrow 0.85\bar J + 0.15 J_c$ ;\; $d = \max(0, \bar J - \bar J_{-10})$ 
6. predictive trend  $p = 0.03\max(0, \bar J - \bar J_{-5})$ 
7.  $\lambda = \max(0, k_p\bar\delta + I + k_d d + p)$ 
8.  $A = (A_r - \lambda(1+0.5\,\mathbb{1}[A_c>0])\,A_c)/(1+\lambda)$

Δ vs. baselineExtends the pid_lag baseline with violation-quadratic gain scheduling targeting 80% of the cost limit, asymmetric integral (fast up, slow down), a predictive trend term, and an asymmetric cost-advantage combination.

kp_base=0.15ki_base=0.02kd_base=0.015safety_margin=0.8 * cost_limitintegral_max=10.0windup_asymmetry=1.0 / 0.33ema_alpha=0.85 / 0.15predict_weight=0.03asymmetry=0.5↻Recovers PID-Lag baseline at violation_ratio=0 and asymmetry=0

Constraint Handling for Safe RL

Description

Safe RL: Constraint-Handling Mechanism Design

Research Question

Background

Evaluation

Code

Method Summary

Adaptive PID-Lag with asymmetric advantage

Results