Shielded Planning Guided Data-Efficient and Safe Reinforcement Learning
Hao Wang,Jiahu Qin,Zhen Kan
DOI: https://doi.org/10.1109/tnnls.2024.3359031
IF: 14.255
2024-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Safe reinforcement learning (RL) has shown great potential for building safe general-purpose robotic systems. While many existing works have focused on post-training policy safety, it remains an open problem to ensure safety during training as well as to improve exploration efficiency. Motivated to address these challenges, this work develops shielded planning guided policy optimization (SPPO), a new model-based safe RL method that augments policy optimization algorithms with path planning and shielding mechanism. In particular, SPPO is equipped with shielded planning for guided exploration and efficient data collection via model predictive path integral (MPPI), along with an advantage-based shielding rule to keep the above processes safe. Based on the collected safe data, a task-oriented parameter optimization (TOPO) method is used for policy improvement, as well as the observation-independent latent dynamics enhancement. In addition, SPPO provides explicit theoretical guarantees, i.e., clear theoretical bounds for training safety, deployment safety, and the learned policy performance. Experiments demonstrate that SPPO outperforms baselines in terms of policy performance, learning efficiency, and safety performance during training.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, hardware & architecture