Abstract:In recent years, the need for the efficient deployment of Neural Networks (NN) on edge devices has been steadily increasing. However, the high computational demand required for Machine Learning (ML) inference on tiny microcontroller-based IoT devices avoids a direct software deployment on such resource-constrained edge devices. Therefore, various custom and application-specific NN hardware accelerators have been proposed to enable real-time Machine Learning (ML) inference on low-power and resource-limited edge devices. Efficient mapping of the computational load onto hardware and software resources is a key challenge for performance improvement while keeping low power and a low area footprint. High performance and yet low power embedded processors may be attained via the usage of hardware acceleration. This paper presents an efficient hardware-software framework to accelerate machine learning inference on edge devices using a modified TensorFlow Lite for Microcontroller (TFLM) model running on a Microcontroller (MCU) and a dedicated Neural Processing Unit (NPU) custom hardware accelerator, referred to as MCU-NPU. The proposed framework supports weight compression of pruned quantized NN models and exploits the pruned model sparsity to reduce computational complexity further. The proposed methodology has been evaluated by employing the MCU-NPU acceleration for various TFLM-based NN architectures using the common MLPerf Tiny benchmark. Experimental results demonstrate a significant speedup of up to 724x compared to a pure software implementation. For example, the resulting runtime for the CIFAR-10 classification is reduced from about 20 sec to only 37 ms using the proposed hardware acceleration. Moreover, the proposed hardware accelerator outperforms all the reference models optimized for edge devices in terms of inference runtime.

Automated deep‐learning model optimization framework for microcontrollers

Explore Training of Deep Convolutional Neural Networks on Battery-powered Mobile Devices: Design and Application

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Differentiable Network Pruning for Microcontrollers

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

Deep Compression for PyTorch Model Deployment on Microcontrollers

Efficient Neural Network Deployment for Microcontroller

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers

Neural networks on microcontrollers: saving memory at inference via operator reordering

Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers

ML-MCU: A Framework to Train ML Classifiers on MCU-based IoT Edge Devices

Quantization and Deployment of Deep Neural Networks on Microcontrollers

Low-Energy On-Device Personalization for MCUs

Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers

MCUNet: Tiny Deep Learning on IoT Devices

Energy-efficient Deployment of Deep Learning Applications on Cortex-M based Microcontrollers using Deep Compression

Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

MicroNAS: Zero-Shot Neural Architecture Search for MCUs