Guarded Policy Optimization with Imperfect Online Demonstrations

Zhenghai Xue,Zhenghao Peng,Quanyi Li,Zhihan Liu,Bolei Zhou
2023-04-24
Abstract:The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at <a class="link-external link-https" href="https://metadriverse.github.io/TS2C" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
This paper attempts to solve the problem in Reinforcement Learning (RL) of how to ensure that the student policy can still be trained efficiently and safely and finally surpass the teacher policy when the teacher policy performs poorly. Specifically: 1. **Limitations of Existing Methods**: The traditional Teacher - Student Framework (TSF) assumes that there is a well - performing teacher policy to guide the learning of the student policy. However, in many practical application scenarios, obtaining a high - performance teacher policy is costly or even impossible. If the performance of the teacher policy is not good enough, it will lead to low learning efficiency of the student policy, and its final performance will be limited by the performance of the teacher policy. 2. **Research Objectives**: This paper proposes a new method - Teacher - Student Shared Control (TS2C), aiming to relax the requirement for high performance of the teacher policy. TS2C enables the student policy to effectively explore the environment and achieve higher cumulative rewards under the premise of ensuring safety by introducing an intervention mechanism based on trajectory value estimation, even when the teacher policy performs poorly. 3. **Innovative Points**: - **Trajectory - Based Value Evaluation**: Different from the intervention based on action similarity in traditional methods, TS2C uses trajectory value estimation to decide whether to intervene. As long as the student's expected return is high enough, it will not be intervened even if its action is different from that of the teacher. - **Theoretical Analysis**: The author proves through theoretical analysis that the TS2C algorithm can achieve efficient exploration and significant safety guarantees without being affected by the performance of the teacher policy itself. - **Experimental Verification**: The experimental results show that TS2C can be trained with teacher policies of different performance levels in various continuous control tasks while maintaining a low training cost, and the cumulative rewards of the student policy can exceed those of the teacher policy in the test environment. In summary, the main contribution of this paper is to provide a new reinforcement learning framework, which can still ensure the effectiveness and safety of the student policy when the teacher policy is not ideal, thereby expanding the application range of the teacher - student framework.