Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks

DCCD: Reducing Neural Network Redundancy Via Distillation

Deep Transfer Learning Method Using Self-Pixel and Global Channel Attentive Regularization

Pay Attention to Convolution Filters: Towards Fast and Accurate Fine-Grained Transfer Learning

Class Attention Transfer Based Knowledge Distillation

Hierarchical Multi-Attention Transfer for Knowledge Distillation

SAKD: Sparse attention knowledge distillation

Sparse Deep Transfer Learning for Convolutional Neural Network

Attention and feature transfer based knowledge distillation

Online Knowledge Distillation via Collaborative Learning

Progressive Network Grafting for Few-Shot Knowledge Distillation

Graph-based Knowledge Distillation by Multi-head Attention Network

Gated Transfer Network for Transfer Learning

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Knowledge Distillation via the Target-aware Transformer

Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Effective Domain Knowledge Transfer with Soft Fine-tuning

Knowledge Distillation Meets Self-Supervision

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Towards Making Deep Transfer Learning Never Hurt

Channel Distillation: Channel-Wise Attention for Knowledge Distillation