Abstract:Fine-grained sparse convolutional neural networks (CNNs) achieve a better trade-off between model accuracy and size than coarse-grained sparse CNNs. Due to irregular data structures and unbalanced computation loads, fine-grained sparse CNNs struggle to fully leverage the performance advantages of computation and storage on general-purpose edge hardware. However, existing custom sparse accelerators are designed from the perspective of emulating a balanced load by software or computational strategies, neglecting the exploration of the computing architecture’s adaptability and parallelism for fine-grained sparse models. To address these challenges, a cross-mesh NoC-based accelerator architecture is proposed. This architecture aligns with the irregular characteristics of fine-grained sparse CNN weights and enhances the spatio-temporal parallelism of fine-grained sparse CNNs. First, a sparse multiplier unit (SMU) array and an adder array are designed to enable parallel execution of convolution multiplication and accumulation operations. Then, element-wise unroll-based nonzero weight multiplication is mapped to the SMU array to provide more flexible spatial parallelism. A horizontal and vertical cross-mesh NoC is proposed for flexible dataflow scheduling between the SMU and adder arrays to further improve temporal parallelism. This architecture allows the multiplication and accumulation operations in convolution to be decoupled and pipelined with negligible latency. Finally, the proposed accelerator architecture is implemented on the ZU9EG platform. The experimental results show that the proposed accelerator achieves frame rates of 509.9, 249.3, 100.7, 48.4, and 168.9 frames per second (FPS) for AlexNet, VGG-16, ResNet-18, MobileNet-v2, and EfficientNet, respectively. Compared with related works, this accelerator achieves inference speed and energy efficiency improvements of 1.1 $\times$ $\sim$ 36.1 $\times$ and 2.4 $\times$ $\sim$ 13.4 $\times$ , respectively.

Research on CNN Parallel Computing and Learning Architecture Based on Real-Time Streaming Architecture.

A Structure–time Parallel Implementation of Spike-Based Deep Learning

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

A High Efficient Architecture for Convolution Neural Network Accelerator

A Parallel Loading Based Accelerator for Convolution Neural Network

Recent Advances in Convolutional Neural Network Acceleration

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization

A High-Performance Systolic Array Accelerator Dedicated for CNN.

Smilodon: an Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning

Parallel Convolutional Neural Network (CNN) Accelerators Based on Stochastic Computing

Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis.

Efficient Scheduling of Irregular Network Structures on CNN Accelerators

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

A Low-Power Hardware Architecture for Real-Time CNN Computing

Real-Time CNN Training and Compression for Neural-Enhanced Adaptive Live Streaming

Efficient Hardware Architectures for Deep Convolutional Neural Network

A Streaming Accelerator for Deep Convolutional Neural Networks with Image and Feature Decomposition for Resource-limited System Applications.

Researching the CNN Collaborative Inference Mechanism for Heterogeneous Edge Devices