Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Cong Li,Gong Cheng,Junwei Han
DOI: https://doi.org/10.1109/tcsvt.2023.3327113
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model’s potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method’s effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e ., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.
What problem does this paper attempt to address?