Abstract:Neural network deployment on low-cost embedded systems, hence on microcontrollers (MCUs), has recently been attracting more attention than ever. Since MCUs have limited memory capacity as well as limited compute-speed, it is critical that we employ model compression, which reduces both memory and compute-speed requirements. In this paper, we add model compression, specifically Deep Compression, and further optimize Unlu's earlier work on arXiv, which efficiently deploys PyTorch models on MCUs. First, we prune the weights in convolutional and fully connected layers. Secondly, the remaining weights and activations are quantized to 8-bit integers from 32-bit floating-point. Finally, forward pass functions are compressed using special data structures for sparse matrices, which store only nonzero weights (without impacting performance and accuracy). In the case of the LeNet-5 model, the memory footprint was reduced by 12.45x, and the inference speed was boosted by 2.57x.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently deploy deep neural network models on microcontrollers (MCUs). Specifically, due to the limited memory capacity and computing speed of microcontrollers, it is very difficult to directly deploy unoptimized deep - learning models. To solve this problem, the paper adopts model compression techniques, especially "Deep Compression", to reduce the memory footprint and computational requirements of the model. ### Main problems: 1. **Memory limitations**: The memory of microcontrollers is very limited and cannot directly support large - scale deep - learning models. 2. **Insufficient computing power**: The computing speed of microcontrollers is slow and cannot efficiently run complex neural network inference tasks. 3. **Limitations of existing methods**: Although previous studies have proposed methods for deploying neural networks on microcontrollers, these methods still have room for improvement in terms of compression rate and inference speed. ### Solutions: To address the above challenges, the paper proposes an optimization method combined with deep - compression techniques, which mainly includes the following steps: 1. **Pruning**: By removing unimportant weights in convolutional layers and fully - connected layers, the number of model parameters is reduced. The specific implementation of pruning uses a binary - search algorithm to find the optimal sparsity to ensure that the model accuracy does not significantly decrease. 2. **Quantization**: The remaining weights and activation values are quantized from 32 - bit floating - point numbers to 8 - bit integers, further reducing the model size and improving the inference speed. During the quantization process, affine quantization is adopted for the output of each layer, and scale quantization is adopted for the weights to prevent accuracy loss. 3. **Sparse matrix storage**: The non - zero weights are stored in the Compressed Sparse Column (CSC) format to avoid storing redundant data, thereby effectively reducing the memory footprint. 4. **Optimization of the forward - propagation algorithm**: According to the characteristics of sparse matrices, the forward - propagation algorithms of fully - connected layers and convolutional layers are improved, so that sparse matrices can be processed more efficiently during the inference process, thereby improving the inference speed. ### Experimental results: Through experimental verification of the LeNet - 5 model, the paper demonstrates the effectiveness of the proposed method. Specifically, after pruning and quantization, the memory footprint of the model is reduced by 12.45 times, the inference speed is increased by 2.57 times, and a high accuracy rate is maintained at the same time. In conclusion, this paper aims to enable deep - learning models to run efficiently on resource - constrained microcontrollers through deep - compression techniques, thereby expanding the application scenarios of deep learning.

Deep Compression for PyTorch Model Deployment on Microcontrollers

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Energy-efficient Deployment of Deep Learning Applications on Cortex-M based Microcontrollers using Deep Compression

Efficient Neural Network Deployment for Microcontroller

On-Demand Deep Model Compression for Mobile Devices

Automated deep‐learning model optimization framework for microcontrollers

Deep Model Compression for Mobile Platforms: A Survey

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Differentiable Network Pruning for Microcontrollers

Deep Model Compression and Architecture Optimization for Embedded Systems: A Survey

Neural networks on microcontrollers: saving memory at inference via operator reordering

UDC: Unified DNAS for Compressible TinyML Models

A New Compression Method for Deep Neural Networks with Accuracy Improvement

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers

Efficient Deep Neural Network Compression for Environmental Sound Classification on Microcontroller Units

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs