Abstract:This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. In a variety of RL applications the safety of the agent is particularly important, e.g. autonomous platforms or robots that work in proximity of humans. As enforcing safety during training might severely limit the agent's exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. This paper proposes a way to approximate moments of belief about the risk associated to the action selection policy. We construct those approximations, and prove the convergence results. We propose a novel method for leveraging the expectation approximations to derive an approximate bound on the confidence that the risk is below a certain level. This approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.

What problem does this paper attempt to address?

The paper primarily addresses the issue of ensuring the safety of agents during the reinforcement learning (RL) process, particularly in avoiding violations of safety constraints during training. The paper proposes a new framework aimed at balancing efficient learning and safe exploration. The main contributions of the paper include: 1. **Proposing a form of cautious reinforcement learning**: It assumes that the agent has limited observability of the state and constructs a probabilistic model of the environment dynamics (i.e., transition probabilities in a Markov Decision Process, MDP) through Bayesian inference. Specifically, the paper uses a Dirichlet-Categorical model to represent these transition probabilities and considers higher-order information (such as variance) to better understand the impact of uncertainty on risk levels. 2. **Risk assessment and reduction**: A risk metric ρm(s, a) is defined, which measures the probability of entering an unsafe state within the next m steps after taking action a from the current state s. To assess this risk, the paper introduces a random variable ϱm(s, a) to represent the agent's belief about the risk ρm(s, a) and provides approximate methods for the expectation and variance of this belief. Additionally, convergence results for these approximations are presented. 3. **Confidence bound estimation**: The Cantelli inequality is used to estimate the confidence level that the agent's risk is below a certain threshold. 4. **RCRL algorithm**: A method called "Risk-aware Cautious Reinforcement Learning" (RCRL) is proposed. This method includes two learners: an optimistic learner focused on maximizing cumulative rewards, and a pessimistic learner that maintains the Dirichlet-Categorical model of the MDP and is used to compute the expected risk and variance of each action at every step. By combining the information from these two learners, the agent can explore the environment more safely. The experimental section demonstrates the performance of the RCRL algorithm in the "Slippery Bridge Crossing" scenario, where the agent needs to reach a target area while avoiding unsafe states. By adjusting different parameters (such as prior information, maximum allowable risk Φmax, etc.), the experiments validate the effectiveness and flexibility of the algorithm. In summary, this paper proposes an innovative approach to addressing the issue of safe exploration in reinforcement learning, particularly in applications where strict risk control is required.

Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

Safe Sim-to-Real Robot Exploration with Constrained Bayesian Optimization

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Safe Exploration in Reinforcement Learning: Training Backup Control Barrier Functions with Zero Training Time Safety Violations

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Safe Reinforcement Learning Using Robust Control Barrier Functions

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Safe reinforcement learning for probabilistic reachability and safety specifications: A Lyapunov-based approach

Benchmarking Safe Exploration in Deep Reinforcement Learning

Progressive Adaptive Chance-Constrained Safeguards for Reinforcement Learning.

Lyapunov-based uncertainty-aware safe reinforcement learning

Probabilistic Safeguard for Reinforcement Learning Using Safety Index Guided Gaussian Process Models

Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning

Safe Exploration Using Bayesian World Models and Log-Barrier Optimization

End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

Safe Model-Based Reinforcement Learning for Systems with Parametric Uncertainties

Probabilistic Constraint for Safety-Critical Reinforcement Learning

A Dynamic Safety Shield for Safe and Efficient Reinforcement Learning of Navigation Tasks

Probabilistic Counterexample Guidance for Safer Reinforcement Learning (Extended Version)

Iterative Reachability Estimation for Safe Reinforcement Learning