Abstract:Convolution neural networks (CNNs) as one of today’s main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters, including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this article, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on application specific integrated circuit (ASIC) and field-programmable gate array (FPGA) platform. Experiments demonstrate that OMNI achieves $3.4\times $ – $6.2\times $ speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows $114.7\times $ energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

DaDianNao: A Machine-Learning Supercomputer

Parallel Photonic Convolutional Processing On-Chip with Cross-Connect Architecture and Cyclic AWGs

A Parallel Loading Based Accelerator for Convolution Neural Network

Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

On-chip 4F-System Based on Concave Mirrors for Optical Neural Networks

NUMA-aware FFT-based Convolution on ARMv8 Many-core CPUs

Model Parallelism Optimization for CNN FPGA Accelerator

A High Efficient Architecture for Convolution Neural Network Accelerator

Performance Analysis of GPU-Based Convolutional Neural Networks

Efficient Hardware Architectures for Deep Convolutional Neural Network

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

CAP: Communication-aware Automated Parallelization for Deep Learning Inference on CMP Architectures

OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs

Efficient fast convolution architectures for convolutional neural network