Abstract:Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at <a class="link-external link-https" href="https://github.com/RL-boxes/Safe-RL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? The paper "CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning" aims to solve the safety problem in reinforcement learning (RL). Specifically, the paper proposes a Conservative Update Policy (CUP) algorithm, which has theoretical safety guarantees and can ensure the safety of exploration while maximizing rewards. ### Background and challenges In real - world applications, it is not enough to only consider maximizing rewards, but also the safety of actions. For example, a robot agent should avoid performing actions that may permanently damage its hardware, and a recommendation system should avoid showing unpleasant items to users. Therefore, safe exploration is crucial for reinforcement learning, which is usually modeled as Constrained Markov Decision Processes (CMDP). Traditional reinforcement learning methods (such as Q - learning and policy gradient methods) often violate safe exploration constraints, which is a major challenge in safe reinforcement learning. Some existing methods use surrogate functions to approximate the objective and constraints, but these methods often rely on convex approximation, resulting in many sources of error and high computational cost problems. ### Main contributions of the paper 1. **New performance bounds and surrogate functions**: - New performance bounds and surrogate functions are proposed, and strict theoretical analysis is provided, expanding these bounds to adapt to the Generalized Advantage Estimator (GAE). GAE significantly reduces variance while maintaining an acceptable level of bias. - The new bounds are more compact than existing works and can better locally approximate the objective and safety constraints when used as surrogate functions. 2. **Non - convex implementation**: - The proposed CUP algorithm provides a non - convex implementation that does not rely on any convex approximation and can adapt to high - dimensional safe reinforcement learning problems. 3. **Theoretical guarantees**: - Theoretical safety guarantees for CUP are provided, including the lower bound of policy improvement and the upper bound of constraint violation. 4. **Experimental verification**: - Extensive experiments show that CUP can effectively improve policy performance while satisfying safety constraints. ### Experimental results The experimental results show that CUP performs excellently in multiple continuous - control tasks, can quickly stabilize the constrained rewards, and converge to higher target rewards more quickly. Specifically, in the Ant - v3 and Hopper - v3 environments, CUP outperforms other existing safe reinforcement learning algorithms (such as CPO, TRPO - L, PPO - L, and FOCOPS). ### Conclusion By proposing the CUP algorithm, this paper solves the key challenges in safe reinforcement learning and provides an effective and safe solution. CUP not only has strict theoretical guarantees but also performs well in practical applications.

CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

Constrained Update Projection Approach to Safe Policy Optimization

Safe Reinforcement Learning Using Finite-Horizon Gradient-based Estimation

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm With Hierarchical Structure

Probabilistic Constraint for Safety-Critical Reinforcement Learning

Feasible Policy Iteration

Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics Via Dual-agent Reinforcement Learning

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

A Review of Safe Reinforcement Learning: Methods, Theory and Applications

Iterative Reachability Estimation for Safe Reinforcement Learning

SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving

A Review of Safe Reinforcement Learning: Methods, Theories, and Applications

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Evaluating Model-free Reinforcement Learning Toward Safety-critical Tasks