Abstract:Convolution neural networks (CNNs) as one of today’s main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters, including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this article, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on application specific integrated circuit (ASIC) and field-programmable gate array (FPGA) platform. Experiments demonstrate that OMNI achieves $3.4\times $ – $6.2\times $ speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows $114.7\times $ energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

Deep Neural Network Acceleration with Sparse Prediction Layers

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Accelerating convolutional neural network by exploiting sparsity on GPUs

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Enabling High Performance Deep Learning Networks on Embedded Systems

SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity

Prune the Convolutional Neural Networks with Sparse Shrink

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Adaptive Pixel-wise Structured Sparse Network for Efficient CNNs

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

A Power-Efficient Accelerator for Convolutional Neural Networks

Learning Efficient Convolutional Networks Through Network Slimming.

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks