Abstract:Deep neural models have achieved remarkable performance on various supervised and unsupervised learning tasks, but it is a challenge to deploy these large-size networks on resource-limited devices. As a representative type of model compression and acceleration methods, knowledge distillation (KD) solves this problem by transferring knowledge from heavy teachers to lightweight students. However, most distillation methods focus on imitating the responses of teacher networks but ignore the information redundancy of student networks. In this article, we propose a novel distillation framework difference-based channel contrastive distillation (DCCD), which introduces channel contrastive knowledge and dynamic difference knowledge into student networks for redundancy reduction. At the feature level, we construct an efficient contrastive objective that broadens student networks' feature expression space and preserves richer information in the feature extraction stage. At the final output level, more detailed knowledge is extracted from teacher networks by making a difference between multiview augmented responses of the same instance. We enhance student networks to be more sensitive to minor dynamic changes. With the improvement of two aspects of DCCD, the student network gains contrastive and difference knowledge and reduces its overfitting and redundancy. Finally, we achieve surprising results that the student approaches and even outperforms the teacher in test accuracy on CIFAR-100. We reduce the top-1 error to 28.16% on ImageNet classification and 24.15% for cross-model transfer with ResNet-18. Empirical experiments and ablation studies on popular datasets show that our proposed method can achieve state-of-the-art accuracy compared with other distillation methods.

Sharing Residual Units Through Collective Tensor Factorization in Deep Neural Networks

Parameters Sharing in Residual Neural Networks

DCCD: Reducing Neural Network Redundancy Via Distillation

Residual Feature-Reutilization Inception Network

ShaResNet: reducing residual network parameter number by sharing weights

Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks

When Residual Learning Meets Dense Aggregation: Rethinking the Aggregation of Deep Neural Networks

Residual encoding framework to compress DNN parameters for fast transfer

DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks

Using accumulation to optimize deep residual neural nets

Double reuses based residual network

Deep Convolutional Neural Networks with Merge-and-Run Mappings

Residual Feature Aggregation Network for Image Super-Resolution.

Residual Networks of Residual Networks: Multilevel Residual Networks

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

DCRNN: A Deep Cross approach based on RNN for Partial Parameter Sharing in Multi-task Learning

Res2Net: A New Multi-Scale Backbone Architecture

BlockDrop: Dynamic Inference Paths in Residual Networks

Improved Residual Networks for Image and Video Recognition

Image Super-Resolution Using Aggregated Residual Transformation Networks With Spatial Attention

DR-Net: An Improved Network for Building Extraction from High Resolution Remote Sensing Image