Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Severin Bochem,Victor J.B. Jung,Arpan Prasad,Francesco Conti,Luca Benini
2024-12-06
Abstract:Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient deployment of Transformer models on resource - constrained, low - power microcontroller units (MCUs), especially in wearable devices such as smart glasses. Specifically, the main challenges include: 1. **Limited on - chip memory**: The on - chip memory of MCUs is very limited, usually not exceeding 8 MiB, which makes it impossible for large Transformer models to run directly on these devices. 2. **High computational requirements**: Transformer models have high computational and memory requirements, resulting in huge challenges for deployment on edge devices. 3. **Latency and energy consumption issues**: Relying on external memory or cloud computing will lead to higher latency, increased energy consumption, and privacy issues. To solve these problems, the paper proposes a distributed inference scheme. By assigning inference tasks to multiple MCUs and minimizing off - chip communication traffic, efficient Transformer model deployment is achieved. The key points of this scheme include: - **Weight distribution**: Scatter - store the weights of the Transformer on multiple MCUs to avoid duplicate copying and reduce on - chip memory usage. - **Synchronization optimization**: Each Transformer block only requires two synchronization operations, reducing the inter - chip communication overhead. - **On - chip execution**: Ensure that all operations during the inference process are completed on - chip as much as possible, reducing the dependence on external memory, thereby reducing latency and energy consumption. Through these methods, the paper demonstrates the effects of deploying TinyLlama and MobileBERT models on multi - chip systems, achieving significant performance improvements and energy - efficiency enhancements. For example, in the case of using 8 MCUs, the autoregressive mode inference speed of TinyLlama is increased by 26.1 times, while the energy consumption remains the same or even decreases.