Abstract:In recent years, the need for the efficient deployment of Neural Networks (NN) on edge devices has been steadily increasing. However, the high computational demand required for Machine Learning (ML) inference on tiny microcontroller-based IoT devices avoids a direct software deployment on such resource-constrained edge devices. Therefore, various custom and application-specific NN hardware accelerators have been proposed to enable real-time Machine Learning (ML) inference on low-power and resource-limited edge devices. Efficient mapping of the computational load onto hardware and software resources is a key challenge for performance improvement while keeping low power and a low area footprint. High performance and yet low power embedded processors may be attained via the usage of hardware acceleration. This paper presents an efficient hardware-software framework to accelerate machine learning inference on edge devices using a modified TensorFlow Lite for Microcontroller (TFLM) model running on a Microcontroller (MCU) and a dedicated Neural Processing Unit (NPU) custom hardware accelerator, referred to as MCU-NPU. The proposed framework supports weight compression of pruned quantized NN models and exploits the pruned model sparsity to reduce computational complexity further. The proposed methodology has been evaluated by employing the MCU-NPU acceleration for various TFLM-based NN architectures using the common MLPerf Tiny benchmark. Experimental results demonstrate a significant speedup of up to 724x compared to a pure software implementation. For example, the resulting runtime for the CIFAR-10 classification is reduced from about 20 sec to only 37 ms using the proposed hardware acceleration. Moreover, the proposed hardware accelerator outperforms all the reference models optimized for edge devices in terms of inference runtime.

Minimizing Latency for Multi-DNN Inference on Resource-Limited CPU-Only Edge Devices

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

A generic deep learning architecture optimization method for edge device based on start-up latency reduction

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

DeeperThings: Fully Distributed CNN Inference on Resource-Constrained Edge Devices

Multi-Model Running Latency Optimization in an Edge Computing Paradigm

Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT Devices

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

DNN Model Compression for IoT Domain-Specific Hardware Accelerators

Multi-Component Optimization and Efficient Deployment of Neural-Networks on Resource-Constrained IoT Hardware

Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers

Collaborative Inference for Deep Neural Networks in Edge Environments