Abstract:Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained {\tt teacher} neural network to train a small {\tt student} network . Treating {\tt teacher}'s feature as knowledge, prevailing methods train {\tt student} by aligning its features with the {\tt teacher}'s, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill {\tt teacher}'s knowledge, simply forcing this alignment does not directly contribute to the {\tt student}'s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better {\tt student} classifier. We are motivated to regularize {\tt student} features at the penultimate layer using {\tt teacher} towards training a better {\tt student} classifier. Specifically, we present a rather simple method that uses {\tt teacher}'s class-mean features to align {\tt student} features w.r.t their {\em direction}. Experiments show that this significantly improves KD performance. Moreover, we empirically find that {\tt student} produces features that have notably smaller norms than {\tt teacher}'s, motivating us to regularize {\tt student} to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes {\tt student} by simultaneously (1) aligning the \emph{direction} of its features with the {\tt teacher} class-mean feature, and (2) encouraging it to produce large-\emph{norm} features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

Enhancement of Knowledge Distillation via Non-Linear Feature Alignment

DCCD: Reducing Neural Network Redundancy Via Distillation

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Reinforced Multi-Teacher Selection for Knowledge Distillation

Knowledge Augmentation for Distillation: A General and Effective Approach to Enhance Knowledge Distillation

SAKD: Sparse attention knowledge distillation

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

CDFKD-MFS: Collaborative Data-free Knowledge Distillation Via Multi-level Feature Sharing

Improving Knowledge Distillation Via Regularizing Feature Direction and Norm

Ability-aware knowledge distillation for resource-constrained embedded devices

Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms

Attention and feature transfer based knowledge distillation

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Improved Knowledge Distillation via Adversarial Collaboration

CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing

Multistage feature fusion knowledge distillation

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Comparative Knowledge Distillation

Dynamic Knowledge Distillation for Pre-trained Language Models