Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Memory Efficient Data-Free Distillation for Continual Learning.

Progressive Learning without Forgetting

Variational Data-Free Knowledge Distillation for Continual Learning.

DCCD: Reducing Neural Network Redundancy Via Distillation

TARGET: Federated Class-Continual Learning Via Exemplar-Free Distillation

Densely Distilling Cumulative Knowledge for Continual Learning

Online Distillation with Continual Learning for Cyclic Domain Shifts

Online Continual Learning with Declarative Memory

Adaptively Integrated Knowledge Distillation and Prediction Uncertainty for Continual Learning

Continual Learning With Knowledge Distillation: A Survey

Centroid Distance Distillation for Effective Rehearsal in Continual Learning

Reducing catastrophic forgetting of incremental learning in the absence of rehearsal memory with task-specific token

Continual Federated Learning Based on Knowledge Distillation

Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation

Layerwise Optimization by Gradient Decomposition for Continual Learning

Memory Recall: A Simple Neural Network Training Framework Against Catastrophic Forgetting

Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning

Distilling Causal Effect of Data in Class-Incremental Learning

Overcoming Catastrophic Forgetting with Unlabeled Data in the Wild

Data-Distortion Guided Self-Distillation for Deep Neural Networks