Learn From the Past: Experience Ensemble Knowledge Distillation

Chaofei Wang,Shaowei Zhang,Shiji Song,Gao Huang
DOI: https://doi.org/10.48550/arXiv.2202.12488
2022-02-25
Abstract:Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic educational scenarios, learning experience is often more important than learning results. In this work, we propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer, named experience ensemble knowledge distillation (EEKD). We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique. A self-attention module is used to adaptively assign weights to different intermediate models in the process of knowledge transfer. Three principles of constructing EEKD on the quality, weights and number of intermediate models are explored. A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students. The experimental results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream knowledge distillation methods and achieves the state-of-the-art. In particular, EEKD even surpasses the standard ensemble distillation on the premise of saving training cost.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the process of knowledge distillation, traditional methods only focus on the knowledge transfer of the pre - trained teacher model, while ignoring the empirical knowledge of the teacher model during the training process. The author believes that in real - life educational scenarios, learning experience is often more important than learning results. Therefore, they propose a new knowledge distillation method - Experience Ensemble Knowledge Distillation (EEKD), aiming to integrate the intermediate model knowledge in the training process of the teacher model through integration techniques and transfer it to the student model. Specifically, the traditional knowledge distillation methods mentioned in the paper mainly rely on the pre - trained teacher model and use soft labels (i.e., the softmax output adjusted by the temperature factor) to guide the learning of the student model. However, this method ignores the empirical knowledge accumulated by the teacher model during the training process. The author believes that this empirical knowledge is equally important, or even more crucial, for the learning of the student model. Based on this motivation, the author proposes the EEKD method. By saving multiple intermediate models during the training process of the teacher model and using the self - attention mechanism to assign weights to these intermediate models, more effective knowledge transfer is achieved. ### Main contributions of the paper: 1. **Proposing a new experience - integrated knowledge distillation method**: This method improves the performance of the student model by integrating the intermediate model knowledge in the training process of the teacher model. 2. **Discovering that a strong integrated teacher does not necessarily produce a strong student**: This finding challenges the existing knowledge distillation methods and prompts people to re - think the strategy of integrated distillation. 3. **Experimental results verify the superiority of EEKD**: The experimental results on the CIFAR - 100 and ImageNet datasets show that EEKD not only outperforms the existing state - of - the - art knowledge distillation methods, but also in some cases even outperforms the standard integrated distillation method, while reducing the training cost. ### Key techniques and principles of the paper: - **Experience Ensemble Knowledge Distillation (EEKD) framework**: By evenly saving multiple intermediate models during the training process of the teacher model, integrating the knowledge of these intermediate models using integration techniques, and finally using the self - attention mechanism to assign weights to different intermediate models to achieve knowledge transfer. - **Self - attention mechanism**: Used to automatically learn the weights of different intermediate models, enabling the student model to obtain more knowledge from the appropriate teacher model. - **Exploring three principles for constructing EEKD**: - **Quality of intermediate models**: High - quality intermediate models help improve the performance of the virtual integrated teacher. - **Weights of intermediate models**: The weight strategy under the self - attention mechanism is superior to the mean, linearly increasing, and linearly decreasing strategies. - **Number of intermediate models**: A trade - off between performance and cost is required. Too many intermediate models will significantly increase the training cost. ### Experimental results: - **On the CIFAR - 100 dataset**: EEKD significantly improves the performance of the student model, with an average increase in accuracy of 4.86%, and in some settings, the performance of the student model even exceeds that of the teacher model. - **On the ImageNet dataset**: EEKD further narrows the performance gap between the teacher and student models, with relative improvements of 54% and 41% respectively. In conclusion, this paper proposes a new knowledge distillation method by introducing the training experience of the teacher model, which not only achieves a significant improvement in performance but also makes a useful theoretical supplement to the existing knowledge distillation methods.