Abstract:Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to ensure that agents perform safe exploration in reinforcement learning (RL) in unknown real - world tasks and at the same time quickly adapt to new tasks. Specifically, the author focuses on training RL agents in a controlled environment before being deployed to the real world, enabling them to learn safe behavior policies without reward signals and avoid violating safety constraints after the target task is revealed. ### Core Problems of the Paper 1. **Reward - free Reinforcement Learning under Safety Constraints**: In a controlled environment, agents learn only based on safety signals, not relying on reward signals. This enables agents to learn how to explore the environment without violating safety constraints. 2. **Transfer Learning from Controlled Environment to Real World**: Research on how to transfer the safe exploration strategies learned in the controlled environment to the real world to ensure that agents can still maintain safe behavior when facing new tasks. 3. **Learning of Guiding Student Policies**: By introducing a "guide" to help the student policy learn faster and adapt to new tasks while ensuring that its behavior always meets safety standards. ### Specific Challenges - **Satisfaction of Safety Constraints**: Ensure that agents do not violate safety constraints during the learning process, especially in the real world. - **Quick Adaptation to New Tasks**: When the target task is revealed, agents need to quickly adjust their behavior to complete the task, rather than just staying in the safe exploration stage. - **Effectiveness of Transfer Learning**: How to effectively transfer the knowledge learned in the controlled environment to completely different or partially different real - world tasks. ### Overview of Solutions The author proposes a method named SaGui (Safe Guide), which mainly includes the following steps: 1. **Training the Safe Guiding Policy (SaGui)**: In a controlled environment, agents learn only based on safety signals, thus forming a policy that can safely explore in various environments. 2. **Transfer Learning**: Transfer the SaGui policy to real - world tasks, using mapping functions to transform the state space of the source task into the state space of the target task. 3. **Policy Distillation**: Through policy distillation techniques, transfer the knowledge of the guiding policy to the student policy, enabling the student policy to learn faster and adapt to new tasks while maintaining safe behavior. 4. **Composite Sampling**: Adopt the method of composite sampling to dynamically adjust the balance between the guiding policy and the student policy to ensure that both safety and learning efficiency can be guaranteed during the training process. Through these steps, the author aims to achieve safe and efficient exploration of agents in unknown real - world tasks and quickly adapt to new tasks, thereby promoting the wide application of reinforcement learning in more practical applications.

Reinforcement Learning by Guided Safe Exploration

Learning to be Safe: Deep RL with a Safety Critic

Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm With Hierarchical Structure

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

Safe Reinforcement Learning in a Simulated Robotic Arm

Benchmarking Safe Exploration in Deep Reinforcement Learning

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Safety-Guided Deep Reinforcement Learning via Online Gaussian Process Estimation

Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms

A Dynamic Safety Shield for Safe and Efficient Reinforcement Learning of Navigation Tasks

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

Safe Reinforcement Learning Using Robust Control Barrier Functions

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

State-Wise Safe Reinforcement Learning With Pixel Observations

Safety reinforcement learning control via transfer learning

Probabilistic Counterexample Guidance for Safer Reinforcement Learning (Extended Version)

Safe Reinforcement Learning with Dead-Ends Avoidance and Recovery

A Deep Safe Reinforcement Learning Approach for Mapless Navigation.