Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

Efficient Binary 3D Convolutional Neural Network and Hardware Accelerator.

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights

A Scalable 3D Array Architecture for Accelerating Convolutional Neural Networks

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

A High Efficient Architecture for Convolution Neural Network Accelerator

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration

A Parallel Loading Based Accelerator for Convolution Neural Network

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

An Efficient Kernel Transformation Architecture for Binary- and Ternary-Weight Neural Network Inference.

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

Area and Energy Efficient 2D Max-Pooling for Convolutional Neural Network Hardware Accelerator