Long-term Safe Reinforcement Learning with Binary Feedback

Akifumi Wachi,Wataru Hashimoto,Kazumune Hashimoto
2024-01-11
Abstract:Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
### The problems the paper attempts to solve The paper "Long - Term Safe Reinforcement Learning with Binary Feedback" aims to solve the long - term safety problem in reinforcement learning (RL). Specifically, the paper focuses on how to ensure that the agent performs safe actions at each time step and maintains a high probability of safety throughout the episode under the conditions of binary safety feedback and an unknown stochastic state - transition function in constrained Markov decision processes (CMDPs). ### Main challenges 1. **Binary safety feedback**: Most existing safe RL algorithms rely on numerical safety feedback, while in practical applications, safety feedback is often binary (i.e., whether an action is safe or not). 2. **Unknown stochastic state - transition**: Many real - world problems have an unknown stochastic state - transition function, which increases the difficulty of ensuring safety. 3. **Long - term safety**: It is necessary not only to ensure safety at the current time step but also to continue to ensure safety in future time steps. 4. **Strict constraint conditions**: In some critical applications (such as autonomous driving, healthcare, robotics), even a single constraint violation can lead to catastrophic consequences, so it is necessary to satisfy safety constraints with a high probability at each time step. ### Solutions To solve the above problems, the paper proposes an algorithm named Long - term Binary - feedback Safe RL (LoBiSaRL). The main features of this algorithm are as follows: 1. **Generalized linear model (GLM) modeling**: Use the generalized linear model to model the binary safety function, so as to be able to handle binary feedback. 2. **Pessimistic estimation of future safety**: By conservatively estimating future safety values, ensure that the agent takes safe actions at each time step. 3. **Long - term safety constraints**: Through theoretical analysis, ensure that the agent satisfies long - term safety constraints with a high probability throughout the episode. 4. **Conservative policy**: Assume that there is a known conservative policy that can suppress the distance of state - transition, thereby providing a moderate safety margin in the early training stage. ### Theoretical results The theoretical results of the paper show that the LoBiSaRL algorithm can ensure long - term safety constraints under the conditions of unknown stochastic state - transition and binary safety feedback. Specifically, by appropriately adjusting the Maximum Divergence from Conservative Policy (MDCP) term, it can be ensured that the agent takes safe actions at each time step and maintains a high probability of safety throughout the episode. ### Experimental results The experimental results show that the LoBiSaRL algorithm is safer than existing methods without significantly sacrificing performance (measured by rewards). This proves the effectiveness and superiority of LoBiSaRL in dealing with long - term safety problems. ### Summary By proposing the LoBiSaRL algorithm, this paper solves the key problem of how to ensure long - term safety under the conditions of binary safety feedback and unknown stochastic state - transition. This result is of great significance for applications in safety - critical fields.