Abstract:Knowledge distillation (KD) is a technique that transfers “dark knowledge” from a deep teacher network (teacher) to a shallow student network (student). Despite significant advances in KD, existing work has not adequately mined two crucial types of knowledge: 1) the knowledge of head categories, which represents the relationship between the target category and its similar categories. Our findings reveal that this highly similar (complex) knowledge is essential for improving student’s performance; and 2) the effectively utilized knowledge of tail categories. Existing studies often treat the non-target categories collectively without sufficiently considering the effectiveness of knowledge from tail categories. To tackle these challenges, we reformulate classical KD (ReKD) into two components: Top- K Inter-class Similar Distillation (TISD) and Non-Top- K Inter-class Discriminability (NTID). Firstly, TISD captures and imparts the knowledge of head categories to the student. Our experimental results have verified that TISD is particularly effective in transferring the knowledge of head categories, even in fine-grained dataset classification. Secondly, we theoretically show that the weighting coefficient of NTID increases with the probability of Top- K , leading to stronger suppression of knowledge transfer for tail categories. This observation explains why difficult samples are more informative than simple ones. To better utilize both types of knowledge, we optimize both TISD and NTID using different weighting coefficients, thereby enhancing the student’s ability to learn this valuable knowledge from both head and tail categories. Furthermore, our extensive experimental results demonstrate that ReKD achieves state-of-the-art performance on various image classification datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet-1K, as well as object detection and instance segmentation using the MS-COCO dataset.

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Exploring All-In-One Knowledge Distillation Framework for Neural Machine Translation

Unraveling Key Factors of Knowledge Distillation

Dual Knowledge Distillation for neural machine translation

Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation

Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Improving Knowledge Distillation Via Head and Tail Categories

Selective Knowledge Distillation for Neural Machine Translation

Neural Collapse Inspired Knowledge Distillation

Multilingual Neural Machine Translation with Knowledge Distillation

Rethinking Knowledge Distillation Via Cross-Entropy

Nearest Neighbor Knowledge Distillation for Neural Machine Translation.

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

ViTKD: Feature-based Knowledge Distillation for Vision Transformers

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge Distillation of Transformer-based Language Models Revisited

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Residual Error Based Knowledge Distillation