Abstract:Convolutional neural networks (CNNs) have been widely applied in the fields of medical tasks because they can achieve high accuracy in many fields using a large number of parameters and operations. However, many applications designed for auxiliary checks or help need to be deployed into portable devices, where the huge number of operations and parameters of a standard CNN can become an obstruction. MobileNet adopts a depthwise separable convolution to replace the standard convolution, which can greatly reduce the number of operations and parameters while maintaining a relatively high accuracy. Such highly structured models are very suitable for FPGA implementation in order to further reduce resource requirements and improve efficiency. Many other implementations focus on performance more than on resource requirements because MobileNets has already reduced both parameters and operations and obtained significant results. However, because many small devices only have limited resources they cannot run MobileNet-like efficient networks in a normal way, and there are still many auxiliary medical applications that require a high-performance network running in real-time to meet the requirements. Hence, we need to figure out a specific accelerator structure to further reduce the memory and other resource requirements while running MobileNet-like efficient networks. In this paper, a MobileNet accelerator is proposed to minimize the on-chip memory capacity and the amount of data that is transferred between on-chip and off-chip memory. We propose two configurable computing modules: Pointwise Convolution Accelerator and Depthwise Convolution Accelerator, to parallelize the network and reduce the memory requirement with a specific dataflow model. At the same time, a new cache usage method is also proposed to further reduce the use of the on-chip memory. We implemented the accelerator on Xilinx XC7Z020, deployed MobileNetV2 on it, and achieved 70.94 FPS with 524.25 KB on-chip memory usage under 150 MHz.

A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

DaDianNao: A Machine-Learning Supercomputer

Memory-Efficient CNN Accelerator Based on Interlayer Feature Map Compression

ASLog: an Area-Efficient CNN Accelerator for Per-Channel Logarithmic Post-Training Quantization

Memory-Efficient Compression Based on Least-Squares Fitting in Convolutional Neural Network Accelerators.

A hardware-friendly logarithmic quantization method for CNNs and FPGA implementation

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Memory-centric accelerator design for Convolutional Neural Networks

Power Efficient Tiny Yolo CNN Using Reduced Hardware Resources Based on Booth Multiplier and WALLACE Tree Adders

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

A Low Memory Requirement MobileNets Accelerator Based on FPGA for Auxiliary Medical Tasks

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

A Small-Footprint Accelerator for Large-Scale Neural Networks

An Energy-Efficient Quantized and Regularized Training Framework for Processing-In-Memory Accelerators

ARBiS: A Hardware-Efficient SRAM CIM CNN Accelerator with Cyclic-Shift Weight Duplication and Parasitic-Capacitance Charge Sharing for AI Edge Application

Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation

An Energy-Efficient Mixed-Bit CNN Accelerator With Column Parallel Readout for ReRAM-Based In-Memory Computing

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration