State-wise Constrained Policy Optimization

Weiye Zhao,Rui Chen,Yifan Sun,Tianhao Wei,Changliu Liu

2024-06-18

Abstract:Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

Machine Learning,Robotics

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the safety issues faced by Reinforcement Learning (RL) algorithms in real-world applications, especially in tasks requiring immediate constraints (such as autonomous driving and robotic operations). Specifically, existing safe reinforcement learning algorithms are mainly based on cumulative constraints or probabilistic constraints, without considering state constraints at each time step. Therefore, the paper proposes the **State-wise Constrained Policy Optimization (SCPO)** method, which is the first general-purpose policy search algorithm for state-constrained reinforcement learning. #### Main Contributions 1. **Theoretical Guarantees**: SCPO provides guarantees for constraint satisfaction in expected states and introduces the Maximum Markov Decision Process (MMDP) framework, proving that the worst-case safety violation is bounded. 2. **Experimental Validation**: The effectiveness of SCPO is validated by training neural network policies on a wide range of robotic motion tasks. The results show that SCPO significantly outperforms existing methods and can handle state constraints in high-dimensional robotic tasks. 3. **Innovations**: SCPO achieves end-to-end policies without explicitly maintaining safety monitors, thereby avoiding safety violation issues present in existing methods. Through these contributions, SCPO takes an important step towards developing practical safe reinforcement learning algorithms applicable to many real-world problems.

State-wise Constrained Policy Optimization

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

CCPO: Conservatively Constrained Policy Optimization Using State Augmentation

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization

Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Safe Multiagent Learning with Soft Constrained Policy Optimization in Real Robot Control

Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning

ROSCOM: Robust Safe Reinforcement Learning on Stochastic Constraint Manifolds

State-wise Safe Reinforcement Learning: A Survey

Almost Surely Safe Exploration and Exploitation for Deep Reinforcement Learning with State Safety Estimation

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

Constrained Reinforcement Learning Under Model Mismatch

Constrained Variational Policy Optimization for Safe Reinforcement Learning

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Shielded Planning Guided Data-Efficient and Safe Reinforcement Learning

Probabilistic Constraint for Safety-Critical Reinforcement Learning

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets