Abstract:Knowledge distillation (KD) is a technique that transfers “dark knowledge” from a deep teacher network (teacher) to a shallow student network (student). Despite significant advances in KD, existing work has not adequately mined two crucial types of knowledge: 1) the knowledge of head categories, which represents the relationship between the target category and its similar categories. Our findings reveal that this highly similar (complex) knowledge is essential for improving student’s performance; and 2) the effectively utilized knowledge of tail categories. Existing studies often treat the non-target categories collectively without sufficiently considering the effectiveness of knowledge from tail categories. To tackle these challenges, we reformulate classical KD (ReKD) into two components: Top- K Inter-class Similar Distillation (TISD) and Non-Top- K Inter-class Discriminability (NTID). Firstly, TISD captures and imparts the knowledge of head categories to the student. Our experimental results have verified that TISD is particularly effective in transferring the knowledge of head categories, even in fine-grained dataset classification. Secondly, we theoretically show that the weighting coefficient of NTID increases with the probability of Top- K , leading to stronger suppression of knowledge transfer for tail categories. This observation explains why difficult samples are more informative than simple ones. To better utilize both types of knowledge, we optimize both TISD and NTID using different weighting coefficients, thereby enhancing the student’s ability to learn this valuable knowledge from both head and tail categories. Furthermore, our extensive experimental results demonstrate that ReKD achieves state-of-the-art performance on various image classification datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet-1K, as well as object detection and instance segmentation using the MS-COCO dataset.

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Revisiting Knowledge Distillation Via Label Smoothing Regularization

Adaptive Explicit Knowledge Transfer for Knowledge Distillation

Class-aware Information for Logit-based Knowledge Distillation

Extending Label Smoothing Regularization with Self-Knowledge Distillation

Improve Knowledge Distillation via Label Revision and Data Selection

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Dynamic Knowledge Distillation for Pre-trained Language Models

Scale Decoupled Distillation

Knowledge Distillation with Refined Logits

An Embarrassingly Simple Approach for Knowledge Distillation

Improving Knowledge Distillation With a Customized Teacher

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

Improving Knowledge Distillation Via Head and Tail Categories

Parameter-Efficient and Student-Friendly Knowledge Distillation

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Student-friendly Knowledge Distillation

NTCE-KD: Non-Target-Class-Enhanced Knowledge Distillation

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Faculty Distillation with Optimal Transport