MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Ji Lin,Wei-Ming Chen,Han Cai,Chuang Gan,Song Han

2024-04-03

Abstract:Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper mainly discusses the memory limitation issue when conducting miniature deep learning on microcontroller units (MCUs). Existing methods such as pruning, quantization, and neural network architecture search focus on reducing parameters and computational complexity, but do not solve the memory bottleneck. The authors found that uneven memory allocation in the design of convolutional neural networks (CNNs) is the main cause of memory limitation, with the initial blocks consuming much more memory than the rest of the network. To address this issue, the paper proposes two strategies: 1) block-wise inference scheduling, which executes the initial stages of CNN step by step in small regions, significantly reducing the peak memory requirement; 2) receptive field redistribution, transferring the receptive field and computational workload of the initial stages to the later stages to reduce computational overhead. Through neural architecture search, the authors automated this process and optimized both the neural network structure and inference scheduling, resulting in the MCUNetV2 model. MCUNetV2 reduces the peak memory usage of existing networks by 4 to 8 times while maintaining model accuracy. It achieves over 90% accuracy in visual wake word tasks with only 32kB SRAM, and achieves high accuracy of 71.8% in image classification on MCUs. Furthermore, MCUNetV2 makes it possible to perform object detection on tiny devices, surpassing the current best result by 16.9% in the Pascal VOC dataset. In conclusion, the paper addresses the memory bottleneck issue in miniature machine learning, paving the way for wider applications including image classification and dense prediction tasks.

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

MCUNet: Tiny Deep Learning on IoT Devices

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

MicroENet: an Efficient Network for MCUs with Low Model Parameters and Peak Memory

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

TinyAD: Memory-efficient anomaly detection for time series data in Industrial IoT

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Neural networks on microcontrollers: saving memory at inference via operator reordering

Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

iMCU: A 28-nm Digital In-Memory Computing-Based Microcontroller Unit for TinyML

On-Device Training Under 256KB Memory

Efficient Memory Management for Deep Neural Net Inference

UDC: Unified DNAS for Compressible TinyML Models

DeepPicarMicro: Applying TinyML to Autonomous Cyber Physical Systems

Efficient Neural Network Deployment for Microcontroller

DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units