Abstract:Safe reinforcement learning tasks are a challenging domain despite being very common in the real world. The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. To address this issue, we first describe the problem with a stronger Uniformly Constrained MDP (UCMDP) model where we impose constraints on all reachable states; we then propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic, as a solution to the Lagrangian dual of a UCMDP. We benchmark Objective Suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. On the driving domain, we evaluate on open source and proprietary data and evaluate transfer to a real autonomous fleet. Empirically, we demonstrate that our proposed method, when combined with existing safe RL algorithms, can match the task reward achieved by baselines with significantly fewer constraint violations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to ensure that the reinforcement learning (RL) algorithm can satisfy safety constraints in all reachable states in multi - constraint safety - critical applications. Specifically: 1. **Limitations of the CMDP model**: The traditional Constrained Markov Decision Process (CMDP) controls risks by restricting the expected return of the constraint function. However, this method only restricts on the expected value, which may lead to high - risk behaviors in some infrequently visited states, thus causing catastrophic consequences in real - world safety - critical fields such as autonomous driving. 2. **Proposing the UCMDP model**: To solve the above problems, the author proposes the Uniformly Constrained MDP (UCMDP). UCMDP imposes uniform constraints on all reachable states instead of just restricting the expected value, thus more strictly ensuring safety, especially when dealing with long - tailed events. 3. **Objective Suppression method**: To solve the UCMDP, the author proposes a new method - Objective Suppression. This method adaptively suppresses the task reward objective and dynamically adjusts the policy optimization direction according to the evaluation of the safety critic. Objective Suppression aims to solve the Lagrangian dual problem of UCMDP and is combined with existing safe RL algorithms to improve safety in multi - constraint scenarios. ### Formula Summary - **Constraint form of CMDP**: \[ C_i: J_{C_i}(\pi)=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t = 0}^{\infty}\gamma^tC_i(s_t,a_t,s_{t + 1})\right] \] - **Optimization objective of UCMDP**: \[ \theta^*=\arg\max_{\theta}J_{\pi}^R\quad\text{s.t.}\quad Q_{\pi}^{C_i}(s,a)\leq\epsilon,\quad\forall i\in\{1,\dots,n\},\quad d_{\pi}(s,a)>0 \] - **Gradient update formula of Objective Suppression**: \[ \nabla_{\theta}J_{\pi}^{\text{supp}}=\mathbb{E}_{s,a}\left[\left(\tilde{p}^-(s,a)Q_{\pi}^R(s,a)-\sum_{i = 1}^nw_i\tilde{p}_i(s,a)Q_{\pi}^{C_i}(s,a)\right)\nabla_{\theta}\log\pi(s,a)\right] \] where: \[ \tilde{p}^-(s,a)=\exp\left(-\kappa\sum_{i}Q_{\pi}^{C_i}(s,a)\right) \] ### Experimental Results The author tested the Objective Suppression method in two safety - critical environments with multiple constraint conditions: - **Safe Mujoco - Ant**: The number of collisions was reduced by 33%, while only 5% of the task reward was lost. - **Safe Bench**: Collisions were reduced by 32.5% and lane departures by 39.1%, and the decrease in the task reward was small. In addition, the author also conducted experiments on a real - driving data set, and the results showed that the performance of their method in the real world was consistent with the simulation results, significantly increasing the driving distance and reducing collisions and sudden braking events.

Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications

Learning Observation-Based Certifiable Safe Policy for Decentralized Multi-Robot Navigation

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Long and Short-Term Constraints Driven Safe Reinforcement Learning for Autonomous Driving

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Progressive Adaptive Chance-Constrained Safeguards for Reinforcement Learning.

Safe Reinforcement Learning with Dual Robustness

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization

Evaluating Model-free Reinforcement Learning Toward Safety-critical Tasks

Safe Multi-Agent Reinforcement Learning with Bilevel Optimization in Autonomous Driving

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

State-wise Safe Reinforcement Learning: A Survey

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Constrained Update Projection Approach to Safe Policy Optimization

Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

Learning Adaptive Safety for Multi-Agent Systems

Safe CoR: A Dual-Expert Approach to Integrating Imitation Learning and Safe Reinforcement Learning Using Constraint Rewards

Enforcing Cooperative Safety for Reinforcement Learning-based Mixed-Autonomy Platoon Control

Multi-Agent Constrained Policy Optimisation

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning