Abstract:Knowledge distillation (KD) is a technique that transfers “dark knowledge” from a deep teacher network (teacher) to a shallow student network (student). Despite significant advances in KD, existing work has not adequately mined two crucial types of knowledge: 1) the knowledge of head categories, which represents the relationship between the target category and its similar categories. Our findings reveal that this highly similar (complex) knowledge is essential for improving student’s performance; and 2) the effectively utilized knowledge of tail categories. Existing studies often treat the non-target categories collectively without sufficiently considering the effectiveness of knowledge from tail categories. To tackle these challenges, we reformulate classical KD (ReKD) into two components: Top- K Inter-class Similar Distillation (TISD) and Non-Top- K Inter-class Discriminability (NTID). Firstly, TISD captures and imparts the knowledge of head categories to the student. Our experimental results have verified that TISD is particularly effective in transferring the knowledge of head categories, even in fine-grained dataset classification. Secondly, we theoretically show that the weighting coefficient of NTID increases with the probability of Top- K , leading to stronger suppression of knowledge transfer for tail categories. This observation explains why difficult samples are more informative than simple ones. To better utilize both types of knowledge, we optimize both TISD and NTID using different weighting coefficients, thereby enhancing the student’s ability to learn this valuable knowledge from both head and tail categories. Furthermore, our extensive experimental results demonstrate that ReKD achieves state-of-the-art performance on various image classification datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet-1K, as well as object detection and instance segmentation using the MS-COCO dataset.

Self-Bidirectional Decoupled Distillation for Time Series Classification

DCCD: Reducing Neural Network Redundancy Via Distillation

Tolerant Self-Distillation for Image Classification

Densely Knowledge-Aware Network for Multivariate Time Series Classification

SelfMatch: Robust semisupervised time‐series classification with self‐distillation

An Efficient Federated Distillation Learning System for Multitask Time Series Classification

An Efficient Federated Distillation Learning System for Multi-task Time Series Classification

Tree-like Decision Distillation

Knowledge transfer via distillation from time and frequency domain for time series classification

Improving Knowledge Distillation Via Head and Tail Categories

Self Supervision to Distillation for Long-Tailed Visual Recognition

One‐stage self‐distillation guided knowledge transfer for long‐tailed visual recognition

Self-Distillation from the Last Mini-Batch for Consistency Regularization

Self-Knowledge Distillation via Progressive Associative Learning

Knowledge distillation-driven semi-supervised multi-view classification

Adversarial Self-Supervised Data-Free Distillation for Text Classification

Distilling Object Detectors via Decoupled Features

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Self-supervised Knowledge Distillation Using Singular Value Decomposition

Self-knowledge distillation via dropout

Bidirectional Distillation for Top-K Recommender System