Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

PCNN: Pattern-based Fine-Grained Regular Pruning Towards Optimizing CNN Accelerators

Structured Deep Neural Network Pruning by Varying Regularization Parameters.

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A Pruning Method Based on the Dissimilarity of Angle among Channels and Filters

High PE Utilization CNN Accelerator with Channel Fusion Supporting Pattern-Compressed Sparse Neural Networks

1xN Pattern for Pruning Convolutional Neural Networks

Dynamic CNN Accelerator Supporting Efficient Filter Generator with Kernel Enhancement and Online Channel Pruning

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

PCONV: the Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices.

LAPP: Layer Adaptive Progressive Pruning for Compressing CNNs from Scratch

NFP: A No Fine-tuning Pruning Approach for Convolutional Neural Network Compression

A Winograd-Based CNN Accelerator with a Fine-Grained Regular Sparsity Pattern

PACA: A Pattern Pruning Algorithm and Channel-Fused High PE Utilization Accelerator for CNNs.

Accelerating Convolutional Neural Networks By Group-Wise 2d-Filter Pruning

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Where to Prune: Using LSTM to Guide Data-Dependent Soft Pruning