Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

Sensitivity-based Acceleration and Compression Algorithm for Convolution Neural Network.

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Survey of Model Compression and Acceleration for Deep Neural Networks.

A Pruning Method Based on the Dissimilarity of Angle among Channels and Filters

Recent Advances in Convolutional Neural Network Acceleration

CNN Acceleration by Low-rank Approximation with Quantized Factors

Efficient Network Compression Through Smooth-Lasso Constraint

Iterative Deep Model Compression and Acceleration in the Frequency Domain.

Convolutional neural network acceleration algorithm based on filters pruning

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Speeding-up and compression convolutional neural networks by low-rank decomposition without fine-tuning

Efficient Neural Network Compression Inspired by Compressive Sensing.

Learning Efficient Convolutional Networks Through Network Slimming.

A High Efficient Architecture for Convolution Neural Network Accelerator

Learning Low Resource Consumption CNN through Pruning and Quantization

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

A Parallel Loading Based Accelerator for Convolution Neural Network

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Efficient and Accurate Approximations of Nonlinear Convolutional Networks