Revisiting Safe Exploration in Safe Reinforcement learning

David Eckel,Baohe Zhang,Joschka Bödecker

2024-09-02

Abstract:Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

Machine Learning,Artificial Intelligence,Robotics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to balance exploration and safety in Reinforcement Learning (RL). Specifically, the paper points out that the existing Safe Reinforcement Learning (SafeRL) methods have deficiencies in evaluating the safe exploration ability during the training process, especially being unable to effectively distinguish different types of unsafe behaviors. For example, existing methods may not be able to distinguish between frequently occurring small mistakes and occasionally occurring serious mistakes, which may lead the algorithm to take more risky behaviors during the exploration process, thus affecting the overall safety. To meet this challenge, the author introduces a new evaluation metric - Expected Maximum Consecutive Cost steps (EMCC). EMCC quantifies the safe exploration ability during the training process by evaluating the consecutive occurrence times of unsafe behaviors, and is especially suitable for distinguishing between long - term unsafe behaviors and occasional unsafe behaviors. In addition, the paper also develops a new benchmark task set - Circle2D, which is used for quickly evaluating and visualizing the safe exploration performance of different SafeRL algorithms. Through these improvements, the paper aims to provide a more refined method to evaluate and understand the safe exploration behaviors of SafeRL algorithms during the training process, thereby helping to design more effective safe exploration strategies.

Revisiting Safe Exploration in Safe Reinforcement learning

Benchmarking Safe Exploration in Deep Reinforcement Learning

Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Safe Reinforcement Learning with Dead-Ends Avoidance and Recovery

Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

State-Wise Safe Reinforcement Learning With Pixel Observations

Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints

Iterative Reachability Estimation for Safe Reinforcement Learning

Safe Exploration by Solving Early Terminated MDP

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Learning to be Safe: Deep RL with a Safety Critic

Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation.

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Probabilistic Counterexample Guidance for Safer Reinforcement Learning (Extended Version)

Safe Reinforcement Learning in Constrained Markov Decision Processes

Constrained Cross-Entropy Method for Safe Reinforcement Learning