Deep Collective Knowledge Distillation

Jihyeon Seo,Kyusam Oh,Chanho Min,Yongkeun Yun,Sungwoo Cho
DOI: https://doi.org/10.48550/arXiv.2304.08878
2023-04-18
Abstract:Many existing studies on knowledge distillation have focused on methods in which a student model mimics a teacher model well. Simply imitating the teacher's knowledge, however, is not sufficient for the student to surpass that of the teacher. We explore a method to harness the knowledge of other students to complement the knowledge of the teacher. We propose deep collective knowledge distillation for model compression, called DCKD, which is a method for training student models with rich information to acquire knowledge from not only their teacher model but also other student models. The knowledge collected from several student models consists of a wealth of information about the correlation between classes. Our DCKD considers how to increase the correlation knowledge of classes during training. Our novel method enables us to create better performing student models for collecting knowledge. This simple yet powerful method achieves state-of-the-art performances in many experiments. For example, for ImageNet, ResNet18 trained with DCKD achieves 72.27\%, which outperforms the pretrained ResNet18 by 2.52\%. For CIFAR-100, the student model of ShuffleNetV1 with DCKD achieves 6.55\% higher top-1 accuracy than the pretrained ShuffleNetV1.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in knowledge distillation, it is not enough to train the student model only by imitating the knowledge of the teacher model, because this limits the ability of the student model to surpass the teacher model. To overcome this limitation, the paper proposes a new method - Deep Collective Knowledge Distillation (DCKD), aiming to enhance the learning ability of the student model by leveraging the collective knowledge among multiple student models. Specifically, DCKD enables the student model to learn not only from the teacher model, but also to obtain additional knowledge from other student models, especially information about inter - class correlations. This method allows the student model to obtain a richer representation, thus achieving state - of - the - art performance in multiple experiments. ### Main contributions of the paper: 1. **Constructing an additional knowledge set**: A novel method is designed to construct an additional knowledge set containing more information about inter - class correlations. 2. **Modifying the collection loss**: Analyze and modify the collection loss between each student and the knowledge set, and optimize the loss function by reversing the direction of the Kullback - Leibler divergence. ### Method overview: - **Knowledge distillation loss**: The student model learns by imitating the output distribution of the teacher model. - **Collection loss**: The student model also learns through the collective knowledge of other student models, which contains rich information about inter - class correlations. - **Total loss function**: The total loss function consists of three parts: cross - entropy loss, knowledge distillation loss, and collection loss. ### Experimental results: - **ImageNet**: After being trained by DCKD, ResNet18 achieved an accuracy rate of 72.27%, which is 2.52% higher than the pre - trained ResNet18. - **CIFAR - 100**: After being trained by DCKD, ShuffleNetV1 achieved a top - 1 accuracy rate 6.55% higher than the pre - trained ShuffleNetV1. - **Comparison of multi - student methods**: DCKD outperforms existing multi - student methods on multiple datasets, even under the guidance of a teacher network. ### Conclusion: By introducing the Deep Collective Knowledge Distillation (DCKD) method, the paper successfully improves the performance of the student model, especially on large - scale datasets. DCKD not only utilizes the knowledge of the teacher model, but also makes full use of the collective knowledge among multiple student models, thus achieving state - of - the - art results in multiple experiments.