Abstract:Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained {\tt teacher} neural network to train a small {\tt student} network . Treating {\tt teacher}'s feature as knowledge, prevailing methods train {\tt student} by aligning its features with the {\tt teacher}'s, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill {\tt teacher}'s knowledge, simply forcing this alignment does not directly contribute to the {\tt student}'s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better {\tt student} classifier. We are motivated to regularize {\tt student} features at the penultimate layer using {\tt teacher} towards training a better {\tt student} classifier. Specifically, we present a rather simple method that uses {\tt teacher}'s class-mean features to align {\tt student} features w.r.t their {\em direction}. Experiments show that this significantly improves KD performance. Moreover, we empirically find that {\tt student} produces features that have notably smaller norms than {\tt teacher}'s, motivating us to regularize {\tt student} to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes {\tt student} by simultaneously (1) aligning the \emph{direction} of its features with the {\tt teacher} class-mean feature, and (2) encouraging it to produce large-\emph{norm} features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

Improving knowledge distillation via an expressive teacher

DCCD: Reducing Neural Network Redundancy Via Distillation

Improving Knowledge Distillation With a Customized Teacher

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

Collaborative Knowledge Distillation

Improving Knowledge Distillation Via Head and Tail Categories

Revisiting Knowledge Distillation Via Label Smoothing Regularization

Interactive Knowledge Distillation for image classification

Improving Knowledge Distillation Via Regularizing Feature Direction and Norm

Improving Knowledge Distillation with Teacher's Explanation

An Embarrassingly Simple Approach for Knowledge Distillation

Rethinking Knowledge Distillation Via Cross-Entropy

Teacher-student collaborative knowledge distillation for image classification

TC<SUP>3</SUP>KD: Knowledge distillation via teacher-student cooperative curriculum customization

Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Knowledge Condensation Distillation

Knowledge Distillation Via Channel Correlation Structure

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation