Proximal Policy Distillation

Giacomo Spigler
2024-07-21
Abstract:We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the sample efficiency and final performance of policy distillation in reinforcement learning. Specifically, the paper introduces a new policy distillation method called Proximal Policy Distillation (PPD), which combines student-driven distillation and Proximal Policy Optimization (PPO) to enhance sample efficiency and leverage additional rewards collected by the student policy during the distillation process. ### Main Problems 1. **Improving Sample Efficiency**: Existing policy distillation methods often overlook the environmental rewards collected by the student policy during the distillation process, which can lead to low sample efficiency. PPD improves sample efficiency by combining PPO and distillation loss, making full use of these rewards. 2. **Enhancing Final Performance**: Traditional policy distillation methods may be limited by the performance of the teacher policy in terms of final performance. PPD aims to enable the student policy to surpass the teacher policy's performance through student-driven distillation and PPO optimization. 3. **Increasing Robustness**: Existing methods may perform poorly when faced with imperfect teacher policies. PPD enhances robustness against imperfect teacher policies by combining environmental rewards and the stability of PPO. ### Solution - **Combining PPO and Distillation Loss**: PPD introduces distillation loss within the PPO framework, allowing the student policy to learn from both the teacher policy and environmental feedback, thereby accelerating the learning process and improving final performance. - **Student-Driven Distillation**: Unlike traditional teacher-driven distillation methods, PPD uses the student policy to collect trajectories, which helps reduce overfitting to teacher demonstrations and improves generalization. - **Dynamic Adjustment of Hyperparameters**: By adjusting the hyperparameter λ that balances PPO loss and distillation loss, the distillation process can be further optimized to enhance the student's final performance. ### Experimental Validation The paper conducts extensive experiments in various reinforcement learning environments, including discrete action and continuous control tasks (such as ATARI, Mujoco, and Procgen). The experimental results show that PPD outperforms traditional policy distillation methods in most cases, especially in terms of sample efficiency and final performance. Additionally, PPD demonstrates stronger robustness when faced with imperfect teacher policies. ### Main Contributions 1. **Proposing the PPD Method**: A new policy distillation method that combines student-driven distillation and PPO, improving sample efficiency and final performance. 2. **Extensive Experimental Evaluation**: Detailed experimental evaluation of PPD in various reinforcement learning environments, validating its superiority. 3. **Robustness Analysis**: PPD shows stronger robustness under imperfect teacher policies, partially recovering the original undamaged performance. 4. **Releasing an Open-Source Library**: The sb3-distill library is released, implementing PPD and two other baseline methods, facilitating researchers to replicate experiments and applications. Through these contributions, the paper provides new ideas and tools for policy distillation in reinforcement learning, helping to improve learning efficiency and final performance.