Abstract:Convolutional neural networks (CNNs) have been widely used in image classification and recognition due to their effectiveness; however, CNNs use a large volume of weight data that is difficult to store in on-chip memory of embedded designs. Pruning can compress the CNN model at a small accuracy loss; however, a pruned CNN model operates slower when implemented on a parallel architecture. In this paper, a hardware-oriented CNN compression strategy is proposed; a deep neural network (DNN) model is divided into "no-pruning layers ( $NP$ -layers)" and "pruning layers ( $P$ -layers)". A $NP$ -layer has a regular weights distribution for parallel computing and high performance. A $P$ -layer is irregular due to pruning, but it generates a high compression ratio. Uniform and incremental quantization schemes are used to achieve a tradeoff between compression ratio and processing efficiency at a small loss in accuracy. A distributed convolutional architecture with several parallel finite impulse response (FIR) filters is further proposed for the regular model in the $NP$ -layers. A shift-accumulator based processing element with an activation-driven data flow (ADF) is proposed for the irregular sparse model in the $P$ -layers. Based on the proposed compression strategy and hardware architecture, a hardware/algorithm co-optimization (HACO) approach is proposed for implementing a $NP-P$ hybrid compressed CNN model on FPGAs. For a hardware accelerator on a single FPGA chip without the use of off-chip memory, a $27.5times $ compression ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. The implementation of the compressed VGG-16 model on a Xilinx VCU118 evaluation board processes 83.0 frames per second (FPS) for image applications, this is $1.8times $ superior than the state-of-the-art design found in the technical literature.

Conv-inheritance: A hardware-efficient method to compress convolutional neural networks for edge applications

Efficient Network Compression Through Smooth-Lasso Constraint

An algorithm/hardware co‐optimized method to accelerate CNNs with compressed convolutional weights on FPGA

Researching the CNN Collaborative Inference Mechanism for Heterogeneous Edge Devices

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Memory-Efficient Compression Based on Least-Squares Fitting in Convolutional Neural Network Accelerators.

Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

A Low-Power Hardware Architecture for Real-Time CNN Computing

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Attention-based Feature Compression for CNN Inference Offloading in Edge Computing

Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

Data-centric Computation Mode for Convolution in Deep Neural Networks.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration

Downscaling and Overflow-aware Model Compression for Efficient Vision Processors

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

Convolutional neural network model compression method

A High Efficient Architecture for Convolution Neural Network Accelerator

Compressing CNNs Using Multilevel Filter Pruning for the Edge Nodes of Multimedia Internet of Things

A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation

A Power-Efficient Accelerator for Convolutional Neural Networks