Understanding Self-Distillation and Partial Label Learning in Multi-Class Classification with Label Noise

Hyeonsu Jeong,Hye Won Chung
2024-02-16
Abstract:Self-distillation (SD) is the process of training a student model using the outputs of a teacher model, with both models sharing the same architecture. Our study theoretically examines SD in multi-class classification with cross-entropy loss, exploring both multi-round SD and SD with refined teacher outputs, inspired by partial label learning (PLL). By deriving a closed-form solution for the student model's outputs, we discover that SD essentially functions as label averaging among instances with high feature correlations. Initially beneficial, this averaging helps the model focus on feature clusters correlated with a given instance for predicting the label. However, it leads to diminishing performance with increasing distillation rounds. Additionally, we demonstrate SD's effectiveness in label noise scenarios and identify the label corruption condition and minimum number of distillation rounds needed to achieve 100% classification accuracy. Our study also reveals that one-step distillation with refined teacher outputs surpasses the efficacy of multi-step SD using the teacher's direct output in high noise rate regimes.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how self - distillation (SD) can improve model performance in the presence of label noise in multi - classification tasks. Specifically, the researchers verified the following points through theoretical analysis and experimental verification: 1. **Effects of self - distillation**: - The paper explored the performance of self - distillation in multi - classification tasks, especially when using the cross - entropy loss function. The researchers derived a closed - form solution for the output of the student model and found that self - distillation can be interpreted as label averaging for instances with high feature correlation. - In the initial stage, this label averaging helps the model focus on feature clusters related to the features of a given instance, thereby improving prediction accuracy. However, as the number of distillation rounds increases, the performance will gradually decline because the model over - relies on label averaging and ignores other important feature information. 2. **Impact of label noise**: - The researchers further explored the effectiveness of self - distillation in the case of label noise. They identified the label corruption conditions and the minimum number of distillation rounds required to achieve 100% classification accuracy. - The experimental results show that in the case of a high label noise rate, the self - distillation method improved by partial label learning (PLL) performs better than the traditional self - distillation method. 3. **Combination of partial label learning**: - The paper also proposed a method of combining partial label learning. By selecting the top two most likely labels from the output of the teacher model as candidate labels, the effect of self - distillation is improved. This method shows significant advantages in the case of a high label noise rate. ### Main contributions - **Theoretical analysis**: The paper provides a theoretical analysis of self - distillation in multi - classification tasks, especially revealing the essence of self - distillation in the form of a closed - form solution. - **Label noise processing**: The paper proposes the effectiveness conditions of self - distillation in the presence of label noise and verifies these conditions through experiments. - **Combination of partial label learning**: The paper introduces the method of partial label learning, further improving the performance of self - distillation in the case of a high label noise rate. ### Key formulas - **Closed - form solution of self - distillation**: \[ Y^{(t)}=\left(Y^{(0)}-\frac{1}{K} \mathbf{1}_{K} \times \mathbf{1}_{Kn}\right)(I_{Kn}-KB)^{t}+\frac{1}{K} \mathbf{1}_{K} \times \mathbf{1}_{Kn} \] where \(Y^{(0)}\) is the initial label matrix, and \(B\) is a matrix based on feature correlation and true labels. - **Output of partial label learning**: \[ Y^{(P)}=\left(\bar{Y}-\frac{1}{K} \mathbf{1}_{K} \times \mathbf{1}_{Kn}\right)(I_{Kn}-KB)+\frac{1}{K} \mathbf{1}_{K} \times \mathbf{1}_{Kn} \] where \(\bar{Y}\) is a two - hot vector composed of the top two labels selected from the output of the teacher model. ### Conclusion The paper verifies the effect of self - distillation in multi - classification tasks through theoretical analysis and experimental verification, especially in the presence of label noise. The research results show that the method of combining partial label learning can significantly improve the performance of self - distillation in the case of a high label noise rate.