Abstract:After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the efficiency of knowledge distillation when the number of classes is small. Specifically, the paper proposes a new method - subclass distillation. By forcing the teacher model to further divide each class into multiple sub - classes during the training process, the amount of information transmitted from the teacher model to the student model is increased. This method is especially suitable for binary classification or scenarios with a small number of classes, because in these cases, due to the limited number of classes, the amount of information that can be transmitted to the student model by traditional distillation methods is also relatively small.
### Background of the Paper and Problem Description
1. **Limitations of Traditional Distillation Methods**:
- Traditional knowledge distillation methods usually achieve knowledge transfer by matching the output probabilities of the teacher model and the student model at the final classification layer.
- For data sets with a large number of classes, this method works well because the probability assignment of wrong classes by the teacher model can provide rich information.
- However, for cases with a small number of classes (such as binary classification tasks), the amount of information provided by the teacher model is limited, resulting in poor distillation results.
2. **Proposal of Subclass Distillation**:
- To overcome the above limitations, the paper proposes the subclass distillation method.
- Subclass distillation increases the amount of information by forcing the teacher model to divide each class into multiple sub - classes during the training process.
- When training, the student model needs to match not only the class probabilities of the teacher model but also the probabilities of these sub - classes.
### Method Overview
1. **Training of the Teacher Model**:
- When training the teacher model, in addition to the regular classification loss, an auxiliary loss is introduced to encourage each sub - class to be used evenly and each prediction result to be "sharp" (that is, the probability is concentrated on a certain sub - class).
- The specific loss functions are as follows:
\[
L_{\text{xent}} = -\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{c} Y_{i,j} \log \left( \sum_{k = 1}^{s} P_{i,j,k} \right)
\]
\[
L_{\text{aux}} = -\frac{1}{n} \sum_{i = 1}^{n} \log \left( \frac{e^{\hat{v}_i^T \hat{v}_i / T}}{\sum_{j = 1}^{n} e^{\hat{v}_i^T \hat{v}_j / T}} \right)
\]
\[
L_{\text{teacher}} = L_{\text{xent}} + \beta L_{\text{aux}}
\]
2. **Training of the Student Model**:
- When training the student model, it is necessary not only to minimize the difference in class probabilities with the teacher model but also to minimize the difference in sub - class probabilities with the teacher model.
- The specific loss functions are as follows:
\[
L_{\text{distill}} = -\frac{T^2}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{c} \sum_{k = 1}^{s} P_{i,j,k} \log (\tilde{P}_{i,j,k})
\]
\[
L_{\text{student}} = \alpha L_{\text{distill}} + (1 - \alpha) L_{\text{xent}}
\]
### Experimental Results
1. **CIFAR - 10 Experiment**:
- The CIFAR - 10 data set is converted into a binary classification task (CIFAR - 2x5). Through subclass distillation, the performance of the student model is significantly improved.
- Subclass distillation not only improves accuracy but also speeds up the training process.
2. **Cel**