Computer Vision Model Compression Techniques for Embedded Systems: A Survey

Alexandre Lopes,Fernando Pereira dos Santos,Diulhio de Oliveira,Mauricio Schiezaro,Helio Pedrini
DOI: https://doi.org/10.1016/j.cag.2024.104015
2024-08-16
Abstract:Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (CNNs), the total number of parameters of leading backbone architectures increased from 62M parameters in 2012 with AlexNet to 7B parameters in 2024 with AIM-7B. Consequently, deploying such deep architectures faces challenges in environments with processing and runtime constraints, particularly in embedded systems. This paper covers the main model compression techniques applied for computer vision tasks, enabling modern models to be used in embedded systems. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique and expected variations when analyzing it on various embedded devices. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges for each subarea and present trends for Model Compression. Case studies for compression models are available at \href{<a class="link-external link-https" href="https://github.com/venturusbr/cv-model-compression" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/venturusbr/cv-model-compression" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to compress deep neural networks and deploy them in resource - constrained embedded systems**. Specifically, as the deep neural networks used in computer vision tasks become more and more complex and large (for example, from AlexNet with 62 million parameters in 2012 to AIM - 7B with 7 billion parameters in 2024), the deployment of these large - scale models on embedded devices with limited computing power, memory, and power consumption is facing challenges. To address this issue, the paper reviews the main model compression techniques for computer vision tasks, including: 1. **Knowledge Distillation**: - By transferring the knowledge of a large teacher model to a small student model, the student model can reduce the number of parameters while maintaining high performance. - Formula representation: \[ L_{KD}=-\sum_{i} \sigma_{i}\left(\frac{z_{t}}{T}\right) \times \log \sigma_{i}\left(\frac{z_{s}}{T}\right) \] where \( z_{t} \) and \( z_{s} \) are the outputs of the teacher model and the student model respectively, and \( T \) is the temperature parameter that controls the smoothness of the probability distribution. 2. **Network Pruning**: - By removing unimportant weights or structures (such as filters, channels, etc.) in the neural network, the size and inference time of the model are reduced. - Pruning can be divided into unstructured pruning (only pruning individual weights) and structured pruning (pruning entire filters or channels). 3. **Network Quantization**: - Convert the network parameters represented by floating - point numbers into low - precision representations (such as 8 - bit integers or binary values), thereby reducing memory usage and increasing inference speed. - Example of the quantization process: \[ w_{quantized}=round\left(\frac{w_{float}}{\Delta}\right) \] where \( w_{float} \) is the original floating - point weight and \( \Delta \) is the quantization step size. 4. **Low - Rank Matrix Factorization**: - By performing matrix factorization on network parameters, the number of parameters is reduced, but this method is less applied in computer vision. The paper also discusses how to select the most suitable technique and analyzes the performance differences of different compression techniques on various embedded devices. In addition, the author provides code examples to help researchers and novices overcome the initial challenges in implementing these techniques.