Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

Rohan Mitta,Hosein Hasanbeig,Jun Wang,Daniel Kroening,Yiannis Kantaros,Alessandro Abate
2023-12-19
Abstract:This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. In a variety of RL applications the safety of the agent is particularly important, e.g. autonomous platforms or robots that work in proximity of humans. As enforcing safety during training might severely limit the agent's exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. This paper proposes a way to approximate moments of belief about the risk associated to the action selection policy. We construct those approximations, and prove the convergence results. We propose a novel method for leveraging the expectation approximations to derive an approximate bound on the confidence that the risk is below a certain level. This approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.
Machine Learning,Logic in Computer Science,Systems and Control
What problem does this paper attempt to address?
The paper primarily addresses the issue of ensuring the safety of agents during the reinforcement learning (RL) process, particularly in avoiding violations of safety constraints during training. The paper proposes a new framework aimed at balancing efficient learning and safe exploration. The main contributions of the paper include: 1. **Proposing a form of cautious reinforcement learning**: It assumes that the agent has limited observability of the state and constructs a probabilistic model of the environment dynamics (i.e., transition probabilities in a Markov Decision Process, MDP) through Bayesian inference. Specifically, the paper uses a Dirichlet-Categorical model to represent these transition probabilities and considers higher-order information (such as variance) to better understand the impact of uncertainty on risk levels. 2. **Risk assessment and reduction**: A risk metric ρm(s, a) is defined, which measures the probability of entering an unsafe state within the next m steps after taking action a from the current state s. To assess this risk, the paper introduces a random variable ϱm(s, a) to represent the agent's belief about the risk ρm(s, a) and provides approximate methods for the expectation and variance of this belief. Additionally, convergence results for these approximations are presented. 3. **Confidence bound estimation**: The Cantelli inequality is used to estimate the confidence level that the agent's risk is below a certain threshold. 4. **RCRL algorithm**: A method called "Risk-aware Cautious Reinforcement Learning" (RCRL) is proposed. This method includes two learners: an optimistic learner focused on maximizing cumulative rewards, and a pessimistic learner that maintains the Dirichlet-Categorical model of the MDP and is used to compute the expected risk and variance of each action at every step. By combining the information from these two learners, the agent can explore the environment more safely. The experimental section demonstrates the performance of the RCRL algorithm in the "Slippery Bridge Crossing" scenario, where the agent needs to reach a target area while avoiding unsafe states. By adjusting different parameters (such as prior information, maximum allowable risk Φmax, etc.), the experiments validate the effectiveness and flexibility of the algorithm. In summary, this paper proposes an innovative approach to addressing the issue of safe exploration in reinforcement learning, particularly in applications where strict risk control is required.