Abstract:Fine-grained sparse convolutional neural networks (CNNs) achieve a better trade-off between model accuracy and size than coarse-grained sparse CNNs. Due to irregular data structures and unbalanced computation loads, fine-grained sparse CNNs struggle to fully leverage the performance advantages of computation and storage on general-purpose edge hardware. However, existing custom sparse accelerators are designed from the perspective of emulating a balanced load by software or computational strategies, neglecting the exploration of the computing architecture’s adaptability and parallelism for fine-grained sparse models. To address these challenges, a cross-mesh NoC-based accelerator architecture is proposed. This architecture aligns with the irregular characteristics of fine-grained sparse CNN weights and enhances the spatio-temporal parallelism of fine-grained sparse CNNs. First, a sparse multiplier unit (SMU) array and an adder array are designed to enable parallel execution of convolution multiplication and accumulation operations. Then, element-wise unroll-based nonzero weight multiplication is mapped to the SMU array to provide more flexible spatial parallelism. A horizontal and vertical cross-mesh NoC is proposed for flexible dataflow scheduling between the SMU and adder arrays to further improve temporal parallelism. This architecture allows the multiplication and accumulation operations in convolution to be decoupled and pipelined with negligible latency. Finally, the proposed accelerator architecture is implemented on the ZU9EG platform. The experimental results show that the proposed accelerator achieves frame rates of 509.9, 249.3, 100.7, 48.4, and 168.9 frames per second (FPS) for AlexNet, VGG-16, ResNet-18, MobileNet-v2, and EfficientNet, respectively. Compared with related works, this accelerator achieves inference speed and energy efficiency improvements of 1.1 $\times$ $\sim$ 36.1 $\times$ and 2.4 $\times$ $\sim$ 13.4 $\times$ , respectively.

A Parallel Processing CNN Accelerator on Embedded Devices Based on Optimized MobileNet

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A High-Performance CNN Processor Based on FPGA for MobileNets

MobileNetV2 Accelerator for Power and Speed Balanced Embedded Applications

A Power-Efficient Accelerator for Convolutional Neural Networks

Design of Efficient CNN Accelerator Based on Zynq Platform.

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

An FPGA-based MobileNet Accelerator Considering Network Structure Characteristics.

An NoC-based CNN Accelerator for Edge Computing

A real-time and high-performance MobileNet accelerator based on adaptive dataflow scheduling for image classification

A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration.

Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization

A Real-time Image Processing Hardware Acceleration Method Based on FPGA

CNN hardware acceleration on a low-power and low-cost APSoC

Accelerating CNN’s Forward Process on Mobile GPU Using OpenCL

A Parallel Loading Based Accelerator for Convolution Neural Network

Accelerating CNN on mobile GPU

A Resource Efficient CNN Accelerator for Sensor Signal Processing Based on FPGA.

Energy-Efficient And High-Throughput Fpga-Based Accelerator For Convolutional Neural Networks

Hardware-Efficient Template-Based Deep CNNs Accelerator Design