Abstract:Convolutional neural networks (CNNs) have been widely applied in the fields of medical tasks because they can achieve high accuracy in many fields using a large number of parameters and operations. However, many applications designed for auxiliary checks or help need to be deployed into portable devices, where the huge number of operations and parameters of a standard CNN can become an obstruction. MobileNet adopts a depthwise separable convolution to replace the standard convolution, which can greatly reduce the number of operations and parameters while maintaining a relatively high accuracy. Such highly structured models are very suitable for FPGA implementation in order to further reduce resource requirements and improve efficiency. Many other implementations focus on performance more than on resource requirements because MobileNets has already reduced both parameters and operations and obtained significant results. However, because many small devices only have limited resources they cannot run MobileNet-like efficient networks in a normal way, and there are still many auxiliary medical applications that require a high-performance network running in real-time to meet the requirements. Hence, we need to figure out a specific accelerator structure to further reduce the memory and other resource requirements while running MobileNet-like efficient networks. In this paper, a MobileNet accelerator is proposed to minimize the on-chip memory capacity and the amount of data that is transferred between on-chip and off-chip memory. We propose two configurable computing modules: Pointwise Convolution Accelerator and Depthwise Convolution Accelerator, to parallelize the network and reduce the memory requirement with a specific dataflow model. At the same time, a new cache usage method is also proposed to further reduce the use of the on-chip memory. We implemented the accelerator on Xilinx XC7Z020, deployed MobileNetV2 on it, and achieved 70.94 FPS with 524.25 KB on-chip memory usage under 150 MHz.

MicroENet: an Efficient Network for MCUs with Low Model Parameters and Peak Memory

MCUNet: Tiny Deep Learning on IoT Devices

DaDianNao: A Machine-Learning Supercomputer

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

On-Device Training Under 256KB Memory

Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review

$μ$NAS: Constrained Neural Architecture Search for Microcontrollers

Efficient Neural Network Deployment for Microcontroller

Design of a Novel Neural Network Compression Method for Tiny Machine Learning

Low-Energy On-Device Personalization for MCUs

Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

Neural networks on microcontrollers: saving memory at inference via operator reordering

MicronNet: A Highly Compact Deep Convolutional Neural Network Architecture for Real-time Embedded Traffic Sign Classification

Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers

Differentiable Network Pruning for Microcontrollers

A Low Memory Requirement MobileNets Accelerator Based on FPGA for Auxiliary Medical Tasks