Abstract:Fine-grained sparse convolutional neural networks (CNNs) achieve a better trade-off between model accuracy and size than coarse-grained sparse CNNs. Due to irregular data structures and unbalanced computation loads, fine-grained sparse CNNs struggle to fully leverage the performance advantages of computation and storage on general-purpose edge hardware. However, existing custom sparse accelerators are designed from the perspective of emulating a balanced load by software or computational strategies, neglecting the exploration of the computing architecture’s adaptability and parallelism for fine-grained sparse models. To address these challenges, a cross-mesh NoC-based accelerator architecture is proposed. This architecture aligns with the irregular characteristics of fine-grained sparse CNN weights and enhances the spatio-temporal parallelism of fine-grained sparse CNNs. First, a sparse multiplier unit (SMU) array and an adder array are designed to enable parallel execution of convolution multiplication and accumulation operations. Then, element-wise unroll-based nonzero weight multiplication is mapped to the SMU array to provide more flexible spatial parallelism. A horizontal and vertical cross-mesh NoC is proposed for flexible dataflow scheduling between the SMU and adder arrays to further improve temporal parallelism. This architecture allows the multiplication and accumulation operations in convolution to be decoupled and pipelined with negligible latency. Finally, the proposed accelerator architecture is implemented on the ZU9EG platform. The experimental results show that the proposed accelerator achieves frame rates of 509.9, 249.3, 100.7, 48.4, and 168.9 frames per second (FPS) for AlexNet, VGG-16, ResNet-18, MobileNet-v2, and EfficientNet, respectively. Compared with related works, this accelerator achieves inference speed and energy efficiency improvements of 1.1 $\times$ $\sim$ 36.1 $\times$ and 2.4 $\times$ $\sim$ 13.4 $\times$ , respectively.

ARA: Cross-Layer Approximate Computing Framework Based Reconfigurable Architecture for CNNs

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

The Storage Structure of Convolutional Neural Network Reconfigurable Accelerator Based on ASIC

A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

A Weight-Reload-Eliminated Compute-in-Memory Accelerator for 60 fps 4K Super-Resolution

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

WRA: A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks.

AxR-NN: Approximate Computation Reuse for Energy-Efficient Convolutional Neural Networks

USCA: A Unified Systolic Convolution Array Architecture for Accelerating Sparse Neural Network

A Sparse CNN Accelerator for Eliminating Redundant Computations in Intra- and Inter-Convolutional/Pooling Layers

A Low-Power Hardware Architecture for Real-Time CNN Computing

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

LACS: A High-Computational-Efficiency Accelerator for CNNs

Approximate Processing Element Design and Analysis for the Implementation of CNN Accelerators

A High-Performance Reconfigurable Accelerator for Convolutional Neural Networks.

A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration.