Abstract:Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of efficiently deploying deep neural networks (DNNs) on heterogeneous edge devices. Specifically, the current best DNN compilation toolchains are usually deeply customized for a single microcontroller unit (MCU) family, and porting these toolchains to different heterogeneous MCU families requires a great deal of redevelopment work. On the other hand, relocatable toolchains, such as TVM, although able to support multiple hardware targets, cannot fully utilize the capabilities of specific accelerators, and the generated code is usually general - purpose but unoptimized. To solve this dilemma, the paper proposes MATCH, a DNN deployment framework based on TVM, aiming to achieve easy and agile relocation between different MCU processors and accelerators through customizable model - aware hardware abstraction. The main contributions of MATCH are: - Proposing MATCH, a new compiler that extends the compilation process of TVM by adding a design space exploration (DSE) tool for DNN layer scheduling. - By enhancing the ZigZag tool to enable it to read DNN layer workloads from TVM, providing an easy - to - modify API to support new hardware, and introducing a new code generation step. - Benchmarking MATCH on two different heterogeneous MCUs, and the results show that the average latency of MATCH on multiple convolutional neural network (CNN) layers is reduced by 119.08 times and 83.18 times respectively compared to the pure TVM solution. - In the end - to - end DNN network of the MLPerf Tiny benchmark, MATCH achieves performance similar to the best SoC - specific open - source toolchain, while reducing the average latency by 2.15 times and 16.94% on the GAP9 and DIANA platforms respectively. In conclusion, MATCH aims to provide a lightweight interface, enabling compilation engineers to easily support existing and future DNN operators and hardware targets while maintaining near - optimal performance.

MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

Accelerating AI performance with the incorporation of TVM and MediaTek NeuroPilot

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

MLonMCU: TinyML Benchmarking with Fast Retargeting

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

swTVM: Towards Optimized Tensor Code Generation for Deep Learning on Sunway Many-Core Processor

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations

Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMD

DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers

Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

SMaLL: A Software Framework for portable Machine Learning Libraries

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

A Hardware-Software Blueprint for Flexible Deep Learning Specialization