CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

Long Yang,Jiaming Ji,Juntao Dai,Yu Zhang,Pengfei Li,Gang Pan
DOI: https://doi.org/10.48550/arXiv.2202.07565
2022-02-16
Abstract:Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at <a class="link-external link-https" href="https://github.com/RL-boxes/Safe-RL" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? The paper "CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning" aims to solve the safety problem in reinforcement learning (RL). Specifically, the paper proposes a Conservative Update Policy (CUP) algorithm, which has theoretical safety guarantees and can ensure the safety of exploration while maximizing rewards. ### Background and challenges In real - world applications, it is not enough to only consider maximizing rewards, but also the safety of actions. For example, a robot agent should avoid performing actions that may permanently damage its hardware, and a recommendation system should avoid showing unpleasant items to users. Therefore, safe exploration is crucial for reinforcement learning, which is usually modeled as Constrained Markov Decision Processes (CMDP). Traditional reinforcement learning methods (such as Q - learning and policy gradient methods) often violate safe exploration constraints, which is a major challenge in safe reinforcement learning. Some existing methods use surrogate functions to approximate the objective and constraints, but these methods often rely on convex approximation, resulting in many sources of error and high computational cost problems. ### Main contributions of the paper 1. **New performance bounds and surrogate functions**: - New performance bounds and surrogate functions are proposed, and strict theoretical analysis is provided, expanding these bounds to adapt to the Generalized Advantage Estimator (GAE). GAE significantly reduces variance while maintaining an acceptable level of bias. - The new bounds are more compact than existing works and can better locally approximate the objective and safety constraints when used as surrogate functions. 2. **Non - convex implementation**: - The proposed CUP algorithm provides a non - convex implementation that does not rely on any convex approximation and can adapt to high - dimensional safe reinforcement learning problems. 3. **Theoretical guarantees**: - Theoretical safety guarantees for CUP are provided, including the lower bound of policy improvement and the upper bound of constraint violation. 4. **Experimental verification**: - Extensive experiments show that CUP can effectively improve policy performance while satisfying safety constraints. ### Experimental results The experimental results show that CUP performs excellently in multiple continuous - control tasks, can quickly stabilize the constrained rewards, and converge to higher target rewards more quickly. Specifically, in the Ant - v3 and Hopper - v3 environments, CUP outperforms other existing safe reinforcement learning algorithms (such as CPO, TRPO - L, PPO - L, and FOCOPS). ### Conclusion By proposing the CUP algorithm, this paper solves the key challenges in safe reinforcement learning and provides an effective and safe solution. CUP not only has strict theoretical guarantees but also performs well in practical applications.