Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

Efficient Deep Convolutional Neural Networks Accelerator Without Multiplication and Retraining

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

Retrain-Less Weight Quantization for Multiplier-Less Convolutional Neural Networks

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Tetris: Re-architecting Convolutional Neural Network Computation for Machine Learning Accelerators

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

FQ-Conv: Fully Quantized Convolution for Efficient and Accurate Inference

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

Memory-Efficient Compression Based on Least-Squares Fitting in Convolutional Neural Network Accelerators.

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity

Space Efficient Quantization for Deep Convolutional Neural Networks

DQI: A Dynamic Quantization Method for Efficient Convolutional Neural Network Inference Accelerators

Accelerating Neural Network Inference by Overflow Aware Quantization

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Toward Low-Bit Neural Network Training Accelerator by Dynamic Group Accumulation