Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Haotian Xu,Shengjie Wang,Zhaolei Wang,Yunzhe Zhang,Qing Zhuo,Yang Gao,Tao Zhang
2023-07-28
Abstract:Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. In this paper, we propose an algorithm named Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) to strike a balance between the exploration efficiency and the constraints satisfaction. In the early stage, our method loosens the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method gradually meets the cost limit's demand in the final training stage. When evaluated on Safety-Gym and Bullet-Safety-Gym benchmarks, our method has shown its advantages over baseline algorithms in terms of safety and optimality. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with baselines.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to balance exploration efficiency and constraint satisfaction in Reinforcement Learning (RL). Specifically, existing methods adopt full - constraint consistency during the training process (i.e., strictly adhere to all safety constraints throughout the training process), which leads to low exploration efficiency in the early stages of training. Such strict constraints limit the exploration space of the policy and may cause the policy to fall into sub - optimal solutions. To solve this problem, the author proposes a new algorithm - Constrained Policy Optimization with Extra Safety Budget (ESB - CPO). This algorithm relaxes the constraints on unsafe states by introducing an additional safety budget, thereby encouraging more efficient exploration in the early stages of training. As the training progresses, these additional safety budgets gradually decrease, ultimately enabling the policy to meet the original constraints. This method not only improves exploration efficiency but also ensures the safety of the final policy. The main contributions of the paper include: - Proposing a new metric, namely Lyapunov - based Advantage Estimation (LAE), for evaluating safe and unsafe transitions. LAE consists of two parts, a stability value and a safety value, where the safety value part has a significant impact only on unsafe transitions. - Based on LAE, designing the ESB - CPO algorithm, which encourages exploration by adding an additional safety budget in the early stages of training and gradually tightens these budgets in the later stages of training to ensure that the safety constraints are finally met. - Updating the two key parameters (\(\alpha\) and \(\beta\)) in LAE through an adaptive method to dynamically adjust the evaluation criteria for safe and unsafe transitions. Experimental results show that ESB - CPO outperforms the baseline algorithms in multiple benchmark tests and obtains higher rewards while ensuring safety. This proves the effectiveness of ESB - CPO in improving exploration efficiency and meeting constraint conditions.