Abstract:Convolution neural networks (CNNs) as one of today’s main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters, including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this article, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on application specific integrated circuit (ASIC) and field-programmable gate array (FPGA) platform. Experiments demonstrate that OMNI achieves $3.4\times $ – $6.2\times $ speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows $114.7\times $ energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

A Balanced Sparse Matrix Convolution Accelerator for Efficient CNN Training

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

An Efficient CNN Training Accelerator Leveraging Transposable Block Sparsity

An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks

An Efficient CNN Accelerator for Pattern-Compressed Sparse Neural Networks on FPGA

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

An Efficient Sparse CNNs Accelerator on FPGA

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks.

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Accelerator for Sparse Convolutional Neural Networks Based on Shift Units

A High-performance Inference Accelerator Exploiting Patterned Sparsity in CNNs

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

High PE Utilization CNN Accelerator with Channel Fusion Supporting Pattern-Compressed Sparse Neural Networks

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

LACS: A High-Computational-Efficiency Accelerator for CNNs

Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs