Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Adversarial Contrastive Distillation with Adaptive Denoising

DCCD: Reducing Neural Network Redundancy Via Distillation

Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation

Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better

Data-Free Adversarial Distillation

Feature Distillation With Guided Adversarial Contrastive Learning

LDN-RC: a Lightweight Denoising Network with Residual Connection to Improve Adversarial Robustness

Adversarially Robust Distillation

Mitigating Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation

Distilling Adversarial Robustness Using Heterogeneous Teachers

Improving Adversarial Robust Fairness via Anti-Bias Soft Label Distillation

Denoising Autoencoder-based Defensive Distillation as an Adversarial Robustness Algorithm

How and When Adversarial Robustness Transfers in Knowledge Distillation?

DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation

Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation

Improving adversarial robustness using knowledge distillation guided by attention information bottleneck

Indirect Gradient Matching for Adversarial Robust Distillation

Complementary Relation Contrastive Distillation

LTD: Low Temperature Distillation for Robust Adversarial Training