Deep Compression for PyTorch Model Deployment on Microcontrollers

Eren Dogan,H. Fatih Ugurdag,Hasan Unlu
DOI: https://doi.org/10.48550/arXiv.2103.15972
2021-03-30
Abstract:Neural network deployment on low-cost embedded systems, hence on microcontrollers (MCUs), has recently been attracting more attention than ever. Since MCUs have limited memory capacity as well as limited compute-speed, it is critical that we employ model compression, which reduces both memory and compute-speed requirements. In this paper, we add model compression, specifically Deep Compression, and further optimize Unlu's earlier work on arXiv, which efficiently deploys PyTorch models on MCUs. First, we prune the weights in convolutional and fully connected layers. Secondly, the remaining weights and activations are quantized to 8-bit integers from 32-bit floating-point. Finally, forward pass functions are compressed using special data structures for sparse matrices, which store only nonzero weights (without impacting performance and accuracy). In the case of the LeNet-5 model, the memory footprint was reduced by 12.45x, and the inference speed was boosted by 2.57x.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently deploy deep neural network models on microcontrollers (MCUs). Specifically, due to the limited memory capacity and computing speed of microcontrollers, it is very difficult to directly deploy unoptimized deep - learning models. To solve this problem, the paper adopts model compression techniques, especially "Deep Compression", to reduce the memory footprint and computational requirements of the model. ### Main problems: 1. **Memory limitations**: The memory of microcontrollers is very limited and cannot directly support large - scale deep - learning models. 2. **Insufficient computing power**: The computing speed of microcontrollers is slow and cannot efficiently run complex neural network inference tasks. 3. **Limitations of existing methods**: Although previous studies have proposed methods for deploying neural networks on microcontrollers, these methods still have room for improvement in terms of compression rate and inference speed. ### Solutions: To address the above challenges, the paper proposes an optimization method combined with deep - compression techniques, which mainly includes the following steps: 1. **Pruning**: By removing unimportant weights in convolutional layers and fully - connected layers, the number of model parameters is reduced. The specific implementation of pruning uses a binary - search algorithm to find the optimal sparsity to ensure that the model accuracy does not significantly decrease. 2. **Quantization**: The remaining weights and activation values are quantized from 32 - bit floating - point numbers to 8 - bit integers, further reducing the model size and improving the inference speed. During the quantization process, affine quantization is adopted for the output of each layer, and scale quantization is adopted for the weights to prevent accuracy loss. 3. **Sparse matrix storage**: The non - zero weights are stored in the Compressed Sparse Column (CSC) format to avoid storing redundant data, thereby effectively reducing the memory footprint. 4. **Optimization of the forward - propagation algorithm**: According to the characteristics of sparse matrices, the forward - propagation algorithms of fully - connected layers and convolutional layers are improved, so that sparse matrices can be processed more efficiently during the inference process, thereby improving the inference speed. ### Experimental results: Through experimental verification of the LeNet - 5 model, the paper demonstrates the effectiveness of the proposed method. Specifically, after pruning and quantization, the memory footprint of the model is reduced by 12.45 times, the inference speed is increased by 2.57 times, and a high accuracy rate is maintained at the same time. In conclusion, this paper aims to enable deep - learning models to run efficiently on resource - constrained microcontrollers through deep - compression techniques, thereby expanding the application scenarios of deep learning.