Model Compression Using Optimal Transport

Suhas Lohit,Michael Jones
DOI: https://doi.org/10.48550/arXiv.2012.03907
2020-12-08
Abstract:Model compression methods are important to allow for easier deployment of deep learning models in compute, memory and energy-constrained environments such as mobile phones. Knowledge distillation is a class of model compression algorithm where knowledge from a large teacher network is transferred to a smaller student network thereby improving the student's performance. In this paper, we show how optimal transport-based loss functions can be used for training a student network which encourages learning student network parameters that help bring the distribution of student features closer to that of the teacher features. We present image classification results on CIFAR-100, SVHN and ImageNet and show that the proposed optimal transport loss functions perform comparably to or better than other loss functions.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational, memory, and energy - consumption challenges faced when deploying deep - learning models in resource - constrained environments. Specifically, the authors propose a new loss function based on Optimal Transport (OT) for Knowledge Distillation (KD) to improve the performance of the small Student Network and make it closer to the performance of the large Teacher Network. Through this method, the size and computational requirements of the model can be significantly reduced without significantly sacrificing accuracy, thereby making deep - learning models easier to deploy in resource - constrained environments such as mobile devices. ### Background and Problem Description of the Paper Deep Convolutional Neural Networks (CNNs) perform well in many computer vision tasks, such as image classification and object detection. However, these models are usually computationally intensive and have high memory usage, which limits their application in resource - constrained environments, such as mobile phones and drones. To overcome this challenge, researchers have developed a variety of model compression techniques, among which knowledge distillation is an effective method. Knowledge distillation improves the performance of the small network by transferring the knowledge of the large Teacher Network to the small Student Network, thereby reducing the size and computational requirements while maintaining high accuracy. ### Proposed Method In this paper, the authors propose a new knowledge distillation method by introducing a loss function based on optimal transport to train the Student Network. Specifically, optimal transport is a principled method for comparing two distributions, even if the two distributions have no overlapping support sets. The authors design new loss functions that encourage the learning parameters of the Student Network to make the distribution of student features closer to the distribution of teacher features. By minimizing the optimal transport cost, the features of the Student Network can be made geometrically closer to the features of the Teacher Network. ### Experimental Results The authors conducted experiments on multiple datasets, including CIFAR - 100, ImageNet, and SVHN, to verify the effectiveness of the proposed optimal transport loss function. The experimental results show that, compared with other existing loss functions, the optimal transport - based loss function can better improve the performance of the Student Network. In particular, when combined with the traditional Knowledge Distillation loss (KD loss) and other contrastive losses (such as CRD loss), the performance improvement is more significant. ### Conclusion This paper proposes a new loss function based on optimal transport for knowledge distillation to improve the performance of the small Student Network. The experimental results show that this new method can effectively improve the accuracy of the Student Network on multiple datasets while maintaining low model complexity, which is suitable for resource - constrained environments. Future research directions include developing "contrastive" optimal transport loss functions and exploring faster distribution comparison methods.