Abstract:Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. In this paper, we propose an algorithm named Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) to strike a balance between the exploration efficiency and the constraints satisfaction. In the early stage, our method loosens the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method gradually meets the cost limit's demand in the final training stage. When evaluated on Safety-Gym and Bullet-Safety-Gym benchmarks, our method has shown its advantages over baseline algorithms in terms of safety and optimality. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with baselines.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to balance exploration efficiency and constraint satisfaction in Reinforcement Learning (RL). Specifically, existing methods adopt full - constraint consistency during the training process (i.e., strictly adhere to all safety constraints throughout the training process), which leads to low exploration efficiency in the early stages of training. Such strict constraints limit the exploration space of the policy and may cause the policy to fall into sub - optimal solutions. To solve this problem, the author proposes a new algorithm - Constrained Policy Optimization with Extra Safety Budget (ESB - CPO). This algorithm relaxes the constraints on unsafe states by introducing an additional safety budget, thereby encouraging more efficient exploration in the early stages of training. As the training progresses, these additional safety budgets gradually decrease, ultimately enabling the policy to meet the original constraints. This method not only improves exploration efficiency but also ensures the safety of the final policy. The main contributions of the paper include: - Proposing a new metric, namely Lyapunov - based Advantage Estimation (LAE), for evaluating safe and unsafe transitions. LAE consists of two parts, a stability value and a safety value, where the safety value part has a significant impact only on unsafe transitions. - Based on LAE, designing the ESB - CPO algorithm, which encourages exploration by adding an additional safety budget in the early stages of training and gradually tightens these budgets in the later stages of training to ensure that the safety constraints are finally met. - Updating the two key parameters (\(\alpha\) and \(\beta\)) in LAE through an adaptive method to dynamically adjust the evaluation criteria for safe and unsafe transitions. Experimental results show that ESB - CPO outperforms the baseline algorithms in multiple benchmark tests and obtains higher rewards while ensuring safety. This proves the effectiveness of ESB - CPO in improving exploration efficiency and meeting constraint conditions.

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Safe Sim-to-Real Robot Exploration with Constrained Bayesian Optimization

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

Learning Observation-Based Certifiable Safe Policy for Decentralized Multi-Robot Navigation

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

State-wise Constrained Policy Optimization

Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction

Train Trajectory Optimization with High-Risk State Space Boundaries: A Safe Reinforcement Learning Approach

Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics Via Dual-agent Reinforcement Learning

Constrained Update Projection Approach to Safe Policy Optimization

Shielded Planning Guided Data-Efficient and Safe Reinforcement Learning

FOSP: Fine-tuning Offline Safe Policy through World Models

Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization

CVaR-Constrained Policy Optimization for Safe Reinforcement Learning

Safe Policy Optimization with Local Generalized Linear Function Approximations.

Safe Driving Via Expert Guided Policy Optimization

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Safe Policy Exploration Improvement via Subgoals

Iterative Reachability Estimation for Safe Reinforcement Learning