Zihan Zhou,Jonathan Booher,Khashayar Rohanimanesh,Wei Liu,Aleksandr Petiushko,Animesh Garg
Abstract:Safe reinforcement learning tasks are a challenging domain despite being very common in the real world. The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. To address this issue, we first describe the problem with a stronger Uniformly Constrained MDP (UCMDP) model where we impose constraints on all reachable states; we then propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic, as a solution to the Lagrangian dual of a UCMDP. We benchmark Objective Suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. On the driving domain, we evaluate on open source and proprietary data and evaluate transfer to a real autonomous fleet. Empirically, we demonstrate that our proposed method, when combined with existing safe RL algorithms, can match the task reward achieved by baselines with significantly fewer constraint violations.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to ensure that the reinforcement learning (RL) algorithm can satisfy safety constraints in all reachable states in multi - constraint safety - critical applications. Specifically:
1. **Limitations of the CMDP model**: The traditional Constrained Markov Decision Process (CMDP) controls risks by restricting the expected return of the constraint function. However, this method only restricts on the expected value, which may lead to high - risk behaviors in some infrequently visited states, thus causing catastrophic consequences in real - world safety - critical fields such as autonomous driving.
2. **Proposing the UCMDP model**: To solve the above problems, the author proposes the Uniformly Constrained MDP (UCMDP). UCMDP imposes uniform constraints on all reachable states instead of just restricting the expected value, thus more strictly ensuring safety, especially when dealing with long - tailed events.
3. **Objective Suppression method**: To solve the UCMDP, the author proposes a new method - Objective Suppression. This method adaptively suppresses the task reward objective and dynamically adjusts the policy optimization direction according to the evaluation of the safety critic. Objective Suppression aims to solve the Lagrangian dual problem of UCMDP and is combined with existing safe RL algorithms to improve safety in multi - constraint scenarios.
### Formula Summary
- **Constraint form of CMDP**:
\[
C_i: J_{C_i}(\pi)=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t = 0}^{\infty}\gamma^tC_i(s_t,a_t,s_{t + 1})\right]
\]
- **Optimization objective of UCMDP**:
\[
\theta^*=\arg\max_{\theta}J_{\pi}^R\quad\text{s.t.}\quad Q_{\pi}^{C_i}(s,a)\leq\epsilon,\quad\forall i\in\{1,\dots,n\},\quad d_{\pi}(s,a)>0
\]
- **Gradient update formula of Objective Suppression**:
\[
\nabla_{\theta}J_{\pi}^{\text{supp}}=\mathbb{E}_{s,a}\left[\left(\tilde{p}^-(s,a)Q_{\pi}^R(s,a)-\sum_{i = 1}^nw_i\tilde{p}_i(s,a)Q_{\pi}^{C_i}(s,a)\right)\nabla_{\theta}\log\pi(s,a)\right]
\]
where:
\[
\tilde{p}^-(s,a)=\exp\left(-\kappa\sum_{i}Q_{\pi}^{C_i}(s,a)\right)
\]
### Experimental Results
The author tested the Objective Suppression method in two safety - critical environments with multiple constraint conditions:
- **Safe Mujoco - Ant**: The number of collisions was reduced by 33%, while only 5% of the task reward was lost.
- **Safe Bench**: Collisions were reduced by 32.5% and lane departures by 39.1%, and the decrease in the task reward was small.
In addition, the author also conducted experiments on a real - driving data set, and the results showed that the performance of their method in the real world was consistent with the simulation results, significantly increasing the driving distance and reducing collisions and sudden braking events.