Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

A Pruning Method Based on the Dissimilarity of Angle among Channels and Filters

Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

HAPM -- Hardware Aware Pruning Method for CNN hardware accelerators in resource constrained devices

Differentiable Joint Pruning and Quantization for Hardware Efficiency

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Quantisation and Pruning for Neural Network Compression and Regularisation

Learning Low Resource Consumption CNN through Pruning and Quantization

Hardware-Aware Evolutionary Explainable Filter Pruning for Convolutional Neural Networks

Conv-inheritance: A hardware-efficient method to compress convolutional neural networks for edge applications

Iterative Filter Pruning for Concatenation-based CNN Architectures

Pruning and quantization for deep neural network acceleration: A survey

Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural Networks

Frequency-Domain Dynamic Pruning for Convolutional Neural Networks

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective