Abstract:The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at <a class="link-external link-https" href="https://metadriverse.github.io/TS2C" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem in Reinforcement Learning (RL) of how to ensure that the student policy can still be trained efficiently and safely and finally surpass the teacher policy when the teacher policy performs poorly. Specifically: 1. **Limitations of Existing Methods**: The traditional Teacher - Student Framework (TSF) assumes that there is a well - performing teacher policy to guide the learning of the student policy. However, in many practical application scenarios, obtaining a high - performance teacher policy is costly or even impossible. If the performance of the teacher policy is not good enough, it will lead to low learning efficiency of the student policy, and its final performance will be limited by the performance of the teacher policy. 2. **Research Objectives**: This paper proposes a new method - Teacher - Student Shared Control (TS2C), aiming to relax the requirement for high performance of the teacher policy. TS2C enables the student policy to effectively explore the environment and achieve higher cumulative rewards under the premise of ensuring safety by introducing an intervention mechanism based on trajectory value estimation, even when the teacher policy performs poorly. 3. **Innovative Points**: - **Trajectory - Based Value Evaluation**: Different from the intervention based on action similarity in traditional methods, TS2C uses trajectory value estimation to decide whether to intervene. As long as the student's expected return is high enough, it will not be intervened even if its action is different from that of the teacher. - **Theoretical Analysis**: The author proves through theoretical analysis that the TS2C algorithm can achieve efficient exploration and significant safety guarantees without being affected by the performance of the teacher policy itself. - **Experimental Verification**: The experimental results show that TS2C can be trained with teacher policies of different performance levels in various continuous control tasks while maintaining a low training cost, and the cumulative rewards of the student policy can exceed those of the teacher policy in the test environment. In summary, the main contribution of this paper is to provide a new reinforcement learning framework, which can still ensure the effectiveness and safety of the student policy when the teacher policy is not ideal, thereby expanding the application range of the teacher - student framework.

Guarded Policy Optimization with Imperfect Online Demonstrations

Safe Sim-to-Real Robot Exploration with Constrained Bayesian Optimization

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Adaptive Teaching in Heterogeneous Agents: Balancing Surprise in Sparse Reward Scenarios

Trajectory-Oriented Policy Optimization with Sparse Rewards

Safe Driving Via Expert Guided Policy Optimization

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Don't Forget Your Teacher: A Corrective Reinforcement Learning Framework

Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring

Online Non-stochastic Control with Partial Feedback

Online Policy Distillation with Decision-Attention

Policy composition in reinforcement learning via multi-objective policy optimization

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

TGRL: An Algorithm for Teacher Guided Reinforcement Learning

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

FOSP: Fine-tuning Offline Safe Policy through World Models

Knowledge Transfer from Simple to Complex: A Safe and Efficient Reinforcement Learning Framework for Autonomous Driving Decision-Making

Online Tuning for Offline Decentralized Multi-Agent Reinforcement Learning

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation

Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble