Abstract:Convolution neural networks (CNNs) as one of today’s main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters, including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this article, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on application specific integrated circuit (ASIC) and field-programmable gate array (FPGA) platform. Experiments demonstrate that OMNI achieves $3.4\times $ – $6.2\times $ speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows $114.7\times $ energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

Support Convolution of CNN with Compression Sparse Matrix Multiplication Flow in TVM

Deep Neural Network Acceleration with Sparse Prediction Layers

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Efficient Network Compression Through Smooth-Lasso Constraint

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

A Pruning Method Based on the Dissimilarity of Angle among Channels and Filters

Compressing CNNs Using Multilevel Filter Pruning for the Edge Nodes of Multimedia Internet of Things

Iterative Deep Model Compression and Acceleration in the Frequency Domain.

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

VSCNN: Convolution Neural Network Accelerator With Vector Sparsity

SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

Prune the Convolutional Neural Networks with Sparse Shrink

SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs.

Accelerating convolutional neural network by exploiting sparsity on GPUs

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration