Abstract:Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of efficiently deploying deep neural networks (DNNs) on heterogeneous TinyML platforms such as embedded systems. Specifically, it focuses on how to optimize the performance and energy efficiency of DNNs on resource - constrained edge devices, especially in modern system - on - chips (SoCs) that contain multiple heterogeneous computing cores and have limited, programmer - managed memory. #### Main problems: 1. **Complexity**: Modern SoCs usually contain multiple heterogeneous computing cores, which makes the deployment of DNNs complex. 2. **Memory limitations**: The computing cores in these SoCs usually have limited memory and need to be optimized to reduce data movement and improve latency and energy efficiency. 3. **Hardware - specific optimization**: Existing tool chains are either too general to fully utilize the hardware features of dedicated accelerators or too specific to adapt to different SoC architectures. #### Solutions: To solve these problems, the authors propose HTVM (Heterogeneous TinyML Virtual Machine), which is a compiler tool chain that combines the advantages of TVM (Tensor Virtual Machine) and DORY. The main contributions of HTVM include: 1. **Extending the TVM compilation process**: By introducing a memory - planning backend (based on DORY), HTVM can generate code and optimize data movement, maximizing the use of dedicated accelerator hardware. 2. **Hardware - aware tiling**: HTVM enables large layers to be efficiently executed on memory - constrained devices through hardware - aware tiling techniques. 3. **Multi - accelerator support**: HTVM can schedule multiple heterogeneous accelerators, reducing the number of kernel calls on the CPU and thus reducing the total latency. 4. **Performance verification**: HTVM has been extensively benchmarked on the DIANA platform, demonstrating significant performance improvements compared to other tool chains. Through these improvements, HTVM can achieve efficient DNN deployment on TinyML platforms, significantly improving performance and reducing memory footprint.

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMD

Robustifying the Deployment of tinyML Models for Autonomous Mini-Vehicles

DTMM: Deploying TinyML Models on Extremely Weak IoT Devices with Pruning

DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

TinyDL: Just-in-time deep learning solution for constrained embedded systems

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

An Ultra-low Power TinyML System for Real-time Visual Processing at Edge

Tiny Machine Learning: Progress and Futures