Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Compression Models via Meta-Learning and Structured Distillation for Named Entity Recognition.

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

DCCD: Reducing Neural Network Redundancy Via Distillation

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition

UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Towards Better Entity Linking with Multi-View Enhanced Distillation

MetaDistiller: Network Self-Boosting Via Meta-Learned Top-Down Distillation

Meta-Learning Adaptive Knowledge Distillation for Efficient Biomedical Natural Language Processing

Reinforced Multi-Teacher Selection for Knowledge Distillation

LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition

Teacher-Free Knowledge Distillation Based on Non-Progressive Meta-Learned Multi Ranking Selection

Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Named Entity Recognition Via Noise Aware Training Mechanism with Data Filter.

Teacher outputs Student outputs Teacher ? Student ? ! ! " !

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Decomposed Meta-Learning for Few-Shot Named Entity Recognition

MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning