Abstract:Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficient deployment of Transformer models on resource - constrained, low - power microcontroller units (MCUs), especially in wearable devices such as smart glasses. Specifically, the main challenges include: 1. **Limited on - chip memory**: The on - chip memory of MCUs is very limited, usually not exceeding 8 MiB, which makes it impossible for large Transformer models to run directly on these devices. 2. **High computational requirements**: Transformer models have high computational and memory requirements, resulting in huge challenges for deployment on edge devices. 3. **Latency and energy consumption issues**: Relying on external memory or cloud computing will lead to higher latency, increased energy consumption, and privacy issues. To solve these problems, the paper proposes a distributed inference scheme. By assigning inference tasks to multiple MCUs and minimizing off - chip communication traffic, efficient Transformer model deployment is achieved. The key points of this scheme include: - **Weight distribution**: Scatter - store the weights of the Transformer on multiple MCUs to avoid duplicate copying and reduce on - chip memory usage. - **Synchronization optimization**: Each Transformer block only requires two synchronization operations, reducing the inter - chip communication overhead. - **On - chip execution**: Ensure that all operations during the inference process are completed on - chip as much as possible, reducing the dependence on external memory, thereby reducing latency and energy consumption. Through these methods, the paper demonstrates the effects of deploying TinyLlama and MobileBERT models on multi - chip systems, achieving significant performance improvements and energy - efficiency enhancements. For example, in the case of using 8 MCUs, the autoregressive mode inference speed of TinyLlama is increased by 26.1 times, while the energy consumption remains the same or even decreases.

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

Efficient Deployment of Transformer Models in Analog In-Memory Computing Hardware

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices

Low-Energy On-Device Personalization for MCUs

Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference

Adaptive Offloading of Transformer Inference for Weak Edge Devices with Masked Autoencoders

Quantization and Deployment of Deep Neural Networks on Microcontrollers

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

MCUNet: Tiny Deep Learning on IoT Devices

Robustifying the Deployment of tinyML Models for Autonomous Mini-Vehicles

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

An Ultra-low Power TinyML System for Real-time Visual Processing at Edge

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

Deep Compression for PyTorch Model Deployment on Microcontrollers

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots