Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Elisavet Lydia Alvanaki,Manolis Katsaragakis,Dimosthenis Masouros,Sotirios Xydis,Dimitrios Soudris

2024-07-04

Abstract:Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.

Hardware Architecture

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address two major challenges faced when deploying Convolutional Neural Networks (CNNs) on resource-constrained microcontrollers (MCUs): 1. **Limited Memory Capacity**: The limited memory capacity of MCUs makes it difficult to store and execute complex CNN models. This issue becomes more pronounced as emerging CNN architectures become deeper. 2. **Energy Efficiency**: Since MCUs are often used in battery-powered edge devices, it is crucial to maintain efficient use of energy resources when performing computationally intensive deep learning tasks. To tackle these problems, the paper proposes a novel method based on Dynamic Voltage and Frequency Scaling (DVFS) to reduce energy consumption by optimizing clock configurations and employing Decoupled Access-Execute (DAE) techniques. Specifically, the method includes the following aspects: - **Selecting the Optimal Clock Configuration**: Choosing the energy-optimal clock scheme among different equivalent delay configurations. - **Applying Decoupled Access-Execute (DAE) Technique**: Dividing the kernel execution of convolutional layers into memory-bound and compute-bound regions to better utilize the processor's idle time during memory access. - **Extracting the Optimal DVFS Strategy**: Formulating the optimal DVFS allocation decisions based on the target CNN architecture and the characteristics of the MCU. Through these methods, the paper achieves significant energy optimization while meeting different Quality of Service (QoS) requirements. Experimental results show that, compared to existing state-of-the-art methods, this approach can achieve up to 25.2% energy reduction.

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Accelerating TinyML Inference on Microcontrollers through Approximate Kernels

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Low-Energy On-Device Personalization for MCUs

An Ultra-low Power TinyML System for Real-time Visual Processing at Edge

Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers

Efficient Neural Network Deployment for Microcontroller

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

iMCU: A 28-nm Digital In-Memory Computing-Based Microcontroller Unit for TinyML

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

ML-MCU: A Framework to Train ML Classifiers on MCU-based IoT Edge Devices

Robustifying the Deployment of tinyML Models for Autonomous Mini-Vehicles

Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers

TinyDL: Just-in-time deep learning solution for constrained embedded systems

Quantization and Deployment of Deep Neural Networks on Microcontrollers

DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers