A Survey on Recent Teacher-student Learning Studies

Minghong Gao
2023-04-10
Abstract:Knowledge distillation is a method of transferring the knowledge from a complex deep neural network (DNN) to a smaller and faster DNN, while preserving its accuracy. Recent variants of knowledge distillation include teaching assistant distillation, curriculum distillation, mask distillation, and decoupling distillation, which aim to improve the performance of knowledge distillation by introducing additional components or by changing the learning process. Teaching assistant distillation involves an intermediate model called the teaching assistant, while curriculum distillation follows a curriculum similar to human education. Mask distillation focuses on transferring the attention mechanism learned by the teacher, and decoupling distillation decouples the distillation loss from the task loss. Overall, these variants of knowledge distillation have shown promising results in improving the performance of knowledge distillation.
Machine Learning
What problem does this paper attempt to address?
This paper aims to solve several key problems in the knowledge distillation (KD) method in deep - learning model compression and acceleration. Specifically, the paper explores the following knowledge distillation variant methods and their improvements: 1. **Teaching Assistant Distillation**: - An intermediate model (called teaching assistant) is introduced to enhance the learning effect of the student model. This method enables the student to better capture the knowledge of the teacher model through the teaching assistant as a bridge between the teacher and the student, thereby improving performance. 2. **Curriculum Distillation**: - The design of the learning process follows a curriculum similar to human education, starting from simple examples and gradually increasing the difficulty. This strategy is especially suitable for tasks that require a large amount of prior knowledge and can significantly improve the performance of the student model. 3. **Mask Distillation**: - It focuses on transferring the attention mechanism learned by the teacher model to the student model. By training the teacher model to generate a mask indicating the importance of each input feature, the student model uses this mask to weight the importance of the input features, thereby improving performance. 4. **Decoupling Distillation**: - The distillation loss is decoupled from the task loss. The student model is trained on the validation set to imitate the output of the teacher model, while the task loss is trained on the training set. This method can better adapt to specific tasks while maintaining the knowledge of the teacher model. In addition, the paper also discusses other improvement methods such as Inverse Probability Weighted Distillation (IPWD), Early Exit of KD, Symmetric Temperature Scaling, DIST method, and Course Temperature for Knowledge Distillation (CTKD). These methods respectively propose effective solutions for different challenges in knowledge distillation, such as sample imbalance, difficulty in knowledge transfer caused by an overly strong teacher model, and low learning efficiency of the student model during the distillation process. In summary, through a review of multiple knowledge distillation variants, this paper aims to provide a comprehensive understanding and explore how to improve the effect of knowledge distillation by introducing new mechanisms or adjusting the learning process, so as to achieve better performance in model compression and acceleration.