Abstract:Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained {\tt teacher} neural network to train a small {\tt student} network . Treating {\tt teacher}'s feature as knowledge, prevailing methods train {\tt student} by aligning its features with the {\tt teacher}'s, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill {\tt teacher}'s knowledge, simply forcing this alignment does not directly contribute to the {\tt student}'s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better {\tt student} classifier. We are motivated to regularize {\tt student} features at the penultimate layer using {\tt teacher} towards training a better {\tt student} classifier. Specifically, we present a rather simple method that uses {\tt teacher}'s class-mean features to align {\tt student} features w.r.t their {\em direction}. Experiments show that this significantly improves KD performance. Moreover, we empirically find that {\tt student} produces features that have notably smaller norms than {\tt teacher}'s, motivating us to regularize {\tt student} to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes {\tt student} by simultaneously (1) aligning the \emph{direction} of its features with the {\tt teacher} class-mean feature, and (2) encouraging it to produce large-\emph{norm} features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

Improving Knowledge Distillation Via Regularizing Feature Direction and Norm

DCCD: Reducing Neural Network Redundancy Via Distillation

Revisiting Knowledge Distillation Via Label Smoothing Regularization

Improving knowledge distillation via an expressive teacher

An Embarrassingly Simple Approach for Knowledge Distillation

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

Rethinking Knowledge Distillation Via Cross-Entropy

Improving Knowledge Distillation With a Customized Teacher

Improving Knowledge Distillation Via Head and Tail Categories

Why does Knowledge Distillation work? Rethink its attention and fidelity mechanism

Improving Knowledge Distillation with Teacher's Explanation

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Online Knowledge Distillation via Collaborative Learning

Enhancement of Knowledge Distillation via Non-Linear Feature Alignment

Knowledge Distillation Performs Partial Variance Reduction

Knowledge Distillation Via Channel Correlation Structure

Distilling Knowledge by Mimicking Features

Student-friendly Knowledge Distillation