Abstract:Abstract Knowledge distillation can transfer the knowledge from the pre-trained teacher model to the student model, thus effectively accomplishing model compression. Previous studies have carefully crafted knowledge representation, targeting loss function design, and distillation location selection, but there have been few studies on the role of classifiers in distillation. Previous experiences have shown that the final classifier of the model has an essential role in making inferences, so this paper attempts to narrow the gap in performance between models by having the student model directly use the classifier of the teacher model for the final inference, which requires an additional projector to help match features of the student encoder with the teacher's classifier. However, a single projector cannot fully align the features, and integrating multiple projectors may result in better performance. Considering the balance between projector size and performance, through experiments, we obtain the size of projectors for different network combinations and propose a simple method for projector integration. In this way, the student model undergoes feature projection and then uses the classifiers of the teacher model for inference, obtaining a similar performance to the teacher model. Through extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets, we show that our approach applies to various teacher–student frameworks simply and effectively.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to compress the knowledge of large pre - trained teacher models into small student models without losing accuracy through knowledge distillation technology. Specifically, the paper focuses on the role of classifiers in knowledge distillation, especially on how to improve the performance of student models by using the classifiers of teacher models. To achieve this goal, the authors propose a knowledge distillation method based on projection integration and classifier sharing. ### Main Problems 1. **Role of Classifiers**: Previous research on knowledge distillation has mainly focused on knowledge representation, loss function design, and selection of distillation positions, but less attention has been paid to the role of classifiers in the distillation process. This paper attempts to narrow the performance gap between student models and teacher models by directly using the classifiers of teacher models. 2. **Feature Matching**: Since the feature dimensions of student models and teacher models are different, directly using the classifiers of teacher models will lead to the problem of feature mismatch. For this reason, the author introduces a projector to help match features. 3. **Integration of Projectors**: A single projector may not be able to fully align features, so the author proposes a simple projector integration method to further improve the performance of student models. ### Solutions 1. **Classifier Sharing (TCS)**: Student models directly use the classifiers of teacher models for final inference, thereby improving performance. 2. **Projector Integration**: By integrating multiple projectors, the accuracy of feature matching is further improved, thereby enhancing the performance of student models. ### Experimental Results - **Datasets**: The experiments were carried out on the CIFAR - 100 and Tiny - ImageNet datasets. - **Model Combinations**: Multiple teacher - student model combinations were used, such as ResNet - 32×4 and ResNet - 8×4, ResNet - 110 and MobileNetV2, etc. - **Performance Improvement**: The experimental results show that this method can significantly improve the performance of student models in multiple teacher - student combinations. For example, in the "ResNet - 32×4&MobileNetV2" and "ResNet - 32×4&ResNet - 8×4" combinations, the performance was improved by 2.97% and 2.79% respectively. ### Formulas - **Direction Alignment Loss**: \[ L_{\text{DA}}=\frac{1}{2b}\sum_{i = 1}^{b}\left\|\frac{\text{Proj}(s_i)}{\|\text{Proj}(s_i)\|_2}-\frac{t_i}{\|t_i\|_2}\right\|_2^2 = 1-\frac{1}{b}\sum_{i = 1}^{b}\frac{\langle\text{Proj}(s_i),t_i\rangle}{\|\text{Proj}(s_i)\|_2\|t_i\|_2} \] - **Modified Direction Alignment Loss**: \[ L_{\text{MDA}} = 1-\frac{1}{b}\sum_{i = 1}^{b}\frac{\langle\text{Proj}_{\text{Int}}(s_i),t_i\rangle}{\|\text{Proj}_{\text{Int}}(s_i)\|_2\|t_i\|_2} \] ### Conclusion This paper successfully improves the performance of student models through the methods of classifier sharing and projector integration, especially achieving remarkable results in feature matching. The experimental results show that this method performs excellently on multiple datasets and model combinations and has high practical value.

Knowledge distillation based on projector integration and classifier sharing

Knowledge Distillation with the Reused Teacher Classifier

Using Less but Important Information for Feature Distillation

Research on Knowledge Distillation Algorithm of Object Detection

Improved Feature Distillation via Projector Ensemble

Understanding the Role of the Projector in Knowledge Distillation

Spherical Knowledge Distillation.

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation

Distilling Knowledge via Intermediate Classifiers

ResKD: Residual-Guided Knowledge Distillation

Densely Guided Knowledge Distillation using Multiple Teacher Assistants

Collaborative Knowledge Distillation

Improved Knowledge Distillation via Teacher Assistant

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Distilling Image Classifiers in Object Detectors

Multistage feature fusion knowledge distillation

Knowledge distillation based on multi-layer fusion features

What Knowledge Gets Distilled in Knowledge Distillation?

Knowledge Distillation with a Precise Teacher and Prediction with Abstention