Revisiting Safe Exploration in Safe Reinforcement learning

David Eckel,Baohe Zhang,Joschka Bödecker
2024-09-02
Abstract:Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to balance exploration and safety in Reinforcement Learning (RL). Specifically, the paper points out that the existing Safe Reinforcement Learning (SafeRL) methods have deficiencies in evaluating the safe exploration ability during the training process, especially being unable to effectively distinguish different types of unsafe behaviors. For example, existing methods may not be able to distinguish between frequently occurring small mistakes and occasionally occurring serious mistakes, which may lead the algorithm to take more risky behaviors during the exploration process, thus affecting the overall safety. To meet this challenge, the author introduces a new evaluation metric - Expected Maximum Consecutive Cost steps (EMCC). EMCC quantifies the safe exploration ability during the training process by evaluating the consecutive occurrence times of unsafe behaviors, and is especially suitable for distinguishing between long - term unsafe behaviors and occasional unsafe behaviors. In addition, the paper also develops a new benchmark task set - Circle2D, which is used for quickly evaluating and visualizing the safe exploration performance of different SafeRL algorithms. Through these improvements, the paper aims to provide a more refined method to evaluate and understand the safe exploration behaviors of SafeRL algorithms during the training process, thereby helping to design more effective safe exploration strategies.