Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Deep-to-Bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Knowledge Distillation of Transformer-based Language Models Revisited

Multi-Granularity Structural Knowledge Distillation for Language Model Compression

Reinforced Multi-Teacher Selection for Knowledge Distillation

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting

Weight-Inherited Distillation for Task-Agnostic BERT Compression

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Knowledge Distillation: A Survey

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Deeply-Supervised Knowledge Distillation

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Unraveling Key Factors of Knowledge Distillation

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Distilling Knowledge via Knowledge Review

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression