Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Knowledge Distillation with Source-free Unsupervised Domain Adaptation for BERT Model Compression.

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

MCKD: Mutually Collaborative Knowledge Distillation for Federated Domain Adaptation and Generalization

Learning to Augment for Data-scarce Domain BERT Knowledge Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

[Knowledge about genotype-phenotype of the diseases should be coming into pediatrician's horizon].

Patient Knowledge Distillation for BERT Model Compression

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Distilling Universal and Joint Knowledge for Cross-Domain Model Compression on Time Series Data

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Comparative analysis of strategies of knowledge distillation on BERT for text matching

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

Effects of Combined Resistance and Endurance Training Versus Resistance Training Alone on Strength, Exercise Capacity, and Quality of Life in Patients With COPD

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Cross-domain knowledge distillation for text classification

Heterogeneous Student Knowledge Distillation From BERT Using a Lightweight Ensemble Framework

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Building a Multi-domain Neural Machine Translation Model using Knowledge Distillation