State-wise Constrained Policy Optimization

Weiye Zhao,Rui Chen,Yifan Sun,Tianhao Wei,Changliu Liu
2024-06-18
Abstract:Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
Machine Learning,Robotics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the safety issues faced by Reinforcement Learning (RL) algorithms in real-world applications, especially in tasks requiring immediate constraints (such as autonomous driving and robotic operations). Specifically, existing safe reinforcement learning algorithms are mainly based on cumulative constraints or probabilistic constraints, without considering state constraints at each time step. Therefore, the paper proposes the **State-wise Constrained Policy Optimization (SCPO)** method, which is the first general-purpose policy search algorithm for state-constrained reinforcement learning. #### Main Contributions 1. **Theoretical Guarantees**: SCPO provides guarantees for constraint satisfaction in expected states and introduces the Maximum Markov Decision Process (MMDP) framework, proving that the worst-case safety violation is bounded. 2. **Experimental Validation**: The effectiveness of SCPO is validated by training neural network policies on a wide range of robotic motion tasks. The results show that SCPO significantly outperforms existing methods and can handle state constraints in high-dimensional robotic tasks. 3. **Innovations**: SCPO achieves end-to-end policies without explicitly maintaining safety monitors, thereby avoiding safety violation issues present in existing methods. Through these contributions, SCPO takes an important step towards developing practical safe reinforcement learning algorithms applicable to many real-world problems.