Abstract:Model compression methods are important to allow for easier deployment of deep learning models in compute, memory and energy-constrained environments such as mobile phones. Knowledge distillation is a class of model compression algorithm where knowledge from a large teacher network is transferred to a smaller student network thereby improving the student's performance. In this paper, we show how optimal transport-based loss functions can be used for training a student network which encourages learning student network parameters that help bring the distribution of student features closer to that of the teacher features. We present image classification results on CIFAR-100, SVHN and ImageNet and show that the proposed optimal transport loss functions perform comparably to or better than other loss functions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the computational, memory, and energy - consumption challenges faced when deploying deep - learning models in resource - constrained environments. Specifically, the authors propose a new loss function based on Optimal Transport (OT) for Knowledge Distillation (KD) to improve the performance of the small Student Network and make it closer to the performance of the large Teacher Network. Through this method, the size and computational requirements of the model can be significantly reduced without significantly sacrificing accuracy, thereby making deep - learning models easier to deploy in resource - constrained environments such as mobile devices. ### Background and Problem Description of the Paper Deep Convolutional Neural Networks (CNNs) perform well in many computer vision tasks, such as image classification and object detection. However, these models are usually computationally intensive and have high memory usage, which limits their application in resource - constrained environments, such as mobile phones and drones. To overcome this challenge, researchers have developed a variety of model compression techniques, among which knowledge distillation is an effective method. Knowledge distillation improves the performance of the small network by transferring the knowledge of the large Teacher Network to the small Student Network, thereby reducing the size and computational requirements while maintaining high accuracy. ### Proposed Method In this paper, the authors propose a new knowledge distillation method by introducing a loss function based on optimal transport to train the Student Network. Specifically, optimal transport is a principled method for comparing two distributions, even if the two distributions have no overlapping support sets. The authors design new loss functions that encourage the learning parameters of the Student Network to make the distribution of student features closer to the distribution of teacher features. By minimizing the optimal transport cost, the features of the Student Network can be made geometrically closer to the features of the Teacher Network. ### Experimental Results The authors conducted experiments on multiple datasets, including CIFAR - 100, ImageNet, and SVHN, to verify the effectiveness of the proposed optimal transport loss function. The experimental results show that, compared with other existing loss functions, the optimal transport - based loss function can better improve the performance of the Student Network. In particular, when combined with the traditional Knowledge Distillation loss (KD loss) and other contrastive losses (such as CRD loss), the performance improvement is more significant. ### Conclusion This paper proposes a new loss function based on optimal transport for knowledge distillation to improve the performance of the small Student Network. The experimental results show that this new method can effectively improve the accuracy of the Student Network on multiple datasets while maintaining low model complexity, which is suitable for resource - constrained environments. Future research directions include developing "contrastive" optimal transport loss functions and exploring faster distribution comparison methods.

Model Compression Using Optimal Transport

A Model Compression Method Using Significant Data and Knowledge Distillation

Improved Model Compression Method Based on Information Entropy

DCCD: Reducing Neural Network Redundancy Via Distillation

Analysis of Model Compression Using Knowledge Distillation

Model Compression for IoT Applications in Industry 4.0 via Multiscale Knowledge Transfer

A Novel Deep Learning Model Compression Algorithm

Holistic CNN Compression Via Low-Rank Decomposition with Knowledge Transfer.

Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

Lossy and Lossless (L$^2$) Post-training Model Size Compression

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Deep learning model compression using network sensitivity and gradients

Deep Learning Model Compression Techniques: Advances, Opportunities, and Perspective

Model compression as constrained optimization, with application to neural nets. Part V: combining compressions

Model Compression for Deep Neural Networks: A Survey

Multi-head Knowledge Distillation for Model Compression

An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Kernel-wise difference minimization for convolutional neural network compression in metaverse

Triplet Knowledge Distillation Networks for Model Compression.

Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection