Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Elisavet Lydia Alvanaki,Manolis Katsaragakis,Dimosthenis Masouros,Sotirios Xydis,Dimitrios Soudris
2024-07-04
Abstract:Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.
Hardware Architecture
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address two major challenges faced when deploying Convolutional Neural Networks (CNNs) on resource-constrained microcontrollers (MCUs): 1. **Limited Memory Capacity**: The limited memory capacity of MCUs makes it difficult to store and execute complex CNN models. This issue becomes more pronounced as emerging CNN architectures become deeper. 2. **Energy Efficiency**: Since MCUs are often used in battery-powered edge devices, it is crucial to maintain efficient use of energy resources when performing computationally intensive deep learning tasks. To tackle these problems, the paper proposes a novel method based on Dynamic Voltage and Frequency Scaling (DVFS) to reduce energy consumption by optimizing clock configurations and employing Decoupled Access-Execute (DAE) techniques. Specifically, the method includes the following aspects: - **Selecting the Optimal Clock Configuration**: Choosing the energy-optimal clock scheme among different equivalent delay configurations. - **Applying Decoupled Access-Execute (DAE) Technique**: Dividing the kernel execution of convolutional layers into memory-bound and compute-bound regions to better utilize the processor's idle time during memory access. - **Extracting the Optimal DVFS Strategy**: Formulating the optimal DVFS allocation decisions based on the target CNN architecture and the characteristics of the MCU. Through these methods, the paper achieves significant energy optimization while meeting different Quality of Service (QoS) requirements. Experimental results show that, compared to existing state-of-the-art methods, this approach can achieve up to 25.2% energy reduction.