Abstract:After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the efficiency of knowledge distillation when the number of classes is small. Specifically, the paper proposes a new method - subclass distillation. By forcing the teacher model to further divide each class into multiple sub - classes during the training process, the amount of information transmitted from the teacher model to the student model is increased. This method is especially suitable for binary classification or scenarios with a small number of classes, because in these cases, due to the limited number of classes, the amount of information that can be transmitted to the student model by traditional distillation methods is also relatively small. ### Background of the Paper and Problem Description 1. **Limitations of Traditional Distillation Methods**: - Traditional knowledge distillation methods usually achieve knowledge transfer by matching the output probabilities of the teacher model and the student model at the final classification layer. - For data sets with a large number of classes, this method works well because the probability assignment of wrong classes by the teacher model can provide rich information. - However, for cases with a small number of classes (such as binary classification tasks), the amount of information provided by the teacher model is limited, resulting in poor distillation results. 2. **Proposal of Subclass Distillation**: - To overcome the above limitations, the paper proposes the subclass distillation method. - Subclass distillation increases the amount of information by forcing the teacher model to divide each class into multiple sub - classes during the training process. - When training, the student model needs to match not only the class probabilities of the teacher model but also the probabilities of these sub - classes. ### Method Overview 1. **Training of the Teacher Model**: - When training the teacher model, in addition to the regular classification loss, an auxiliary loss is introduced to encourage each sub - class to be used evenly and each prediction result to be "sharp" (that is, the probability is concentrated on a certain sub - class). - The specific loss functions are as follows: \[ L_{\text{xent}} = -\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{c} Y_{i,j} \log \left( \sum_{k = 1}^{s} P_{i,j,k} \right) \] \[ L_{\text{aux}} = -\frac{1}{n} \sum_{i = 1}^{n} \log \left( \frac{e^{\hat{v}_i^T \hat{v}_i / T}}{\sum_{j = 1}^{n} e^{\hat{v}_i^T \hat{v}_j / T}} \right) \] \[ L_{\text{teacher}} = L_{\text{xent}} + \beta L_{\text{aux}} \] 2. **Training of the Student Model**: - When training the student model, it is necessary not only to minimize the difference in class probabilities with the teacher model but also to minimize the difference in sub - class probabilities with the teacher model. - The specific loss functions are as follows: \[ L_{\text{distill}} = -\frac{T^2}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{c} \sum_{k = 1}^{s} P_{i,j,k} \log (\tilde{P}_{i,j,k}) \] \[ L_{\text{student}} = \alpha L_{\text{distill}} + (1 - \alpha) L_{\text{xent}} \] ### Experimental Results 1. **CIFAR - 10 Experiment**: - The CIFAR - 10 data set is converted into a binary classification task (CIFAR - 2x5). Through subclass distillation, the performance of the student model is significantly improved. - Subclass distillation not only improves accuracy but also speeds up the training process. 2. **Cel**

Subclass Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Tree-like Decision Distillation

Contrastive Representation Distillation

Distilling Knowledge via Intermediate Classifiers

Teaching What You Should Teach: A Data-Based Distillation Method

Towards Understanding Knowledge Distillation

Debiased Distillation by Transplanting the Last Layer

Exploring the Knowledge Transferred by Response-Based Teacher-Student Distillation

UNIC: Universal Classification Models via Multi-teacher Distillation

DCD: Discriminative and Consistent Representation Distillation

Adversarial Distillation for Learning with Privileged Provisions

Knowledge Distillation with a Precise Teacher and Prediction with Abstention

Customizing a Teacher for Feature Distillation

What Knowledge Gets Distilled in Knowledge Distillation?

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Multi-Teacher Knowledge Distillation for Incremental Implicitly-Refined Classification

Subclass Knowledge Distillation with Known Subclass Labels

What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias

Linear Projections of Teacher Embeddings for Few-Class Distillation

Prune Your Model Before Distill It