Reinforcement Learning by Guided Safe Exploration

Qisong Yang,Thiago D. Simão,Nils Jansen,Simon H. Tindemans,Matthijs T. J. Spaan
DOI: https://doi.org/10.3233/FAIA230598
2023-07-27
Abstract:Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to ensure that agents perform safe exploration in reinforcement learning (RL) in unknown real - world tasks and at the same time quickly adapt to new tasks. Specifically, the author focuses on training RL agents in a controlled environment before being deployed to the real world, enabling them to learn safe behavior policies without reward signals and avoid violating safety constraints after the target task is revealed. ### Core Problems of the Paper 1. **Reward - free Reinforcement Learning under Safety Constraints**: In a controlled environment, agents learn only based on safety signals, not relying on reward signals. This enables agents to learn how to explore the environment without violating safety constraints. 2. **Transfer Learning from Controlled Environment to Real World**: Research on how to transfer the safe exploration strategies learned in the controlled environment to the real world to ensure that agents can still maintain safe behavior when facing new tasks. 3. **Learning of Guiding Student Policies**: By introducing a "guide" to help the student policy learn faster and adapt to new tasks while ensuring that its behavior always meets safety standards. ### Specific Challenges - **Satisfaction of Safety Constraints**: Ensure that agents do not violate safety constraints during the learning process, especially in the real world. - **Quick Adaptation to New Tasks**: When the target task is revealed, agents need to quickly adjust their behavior to complete the task, rather than just staying in the safe exploration stage. - **Effectiveness of Transfer Learning**: How to effectively transfer the knowledge learned in the controlled environment to completely different or partially different real - world tasks. ### Overview of Solutions The author proposes a method named SaGui (Safe Guide), which mainly includes the following steps: 1. **Training the Safe Guiding Policy (SaGui)**: In a controlled environment, agents learn only based on safety signals, thus forming a policy that can safely explore in various environments. 2. **Transfer Learning**: Transfer the SaGui policy to real - world tasks, using mapping functions to transform the state space of the source task into the state space of the target task. 3. **Policy Distillation**: Through policy distillation techniques, transfer the knowledge of the guiding policy to the student policy, enabling the student policy to learn faster and adapt to new tasks while maintaining safe behavior. 4. **Composite Sampling**: Adopt the method of composite sampling to dynamically adjust the balance between the guiding policy and the student policy to ensure that both safety and learning efficiency can be guaranteed during the training process. Through these steps, the author aims to achieve safe and efficient exploration of agents in unknown real - world tasks and quickly adapt to new tasks, thereby promoting the wide application of reinforcement learning in more practical applications.