Abstract:Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset.

Using Distillation to Improve Network Performance after Pruning and Quantization

Class-Aware Pruning for Efficient Neural Networks

Loss Constrains Added Squeeze and Excitation Blocks for Pruning Deep Neural Networks

A Novel Deep Learning Model Compression Algorithm

A Model Compression Method Using Significant Data and Knowledge Distillation

Regularized Training Framework for Combining Pruning and Quantization to Compress Neural Networks

Pruning and quantization for deep neural network acceleration: A survey

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Few Sample Knowledge Distillation for Efficient Network Compression

Learning Low Resource Consumption CNN through Pruning and Quantization

Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing.

CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization

Pruning-and-distillation: One-stage Joint Compression Framework for CNNs Via Clustering

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

An Efficient Method for Model Pruning Using Knowledge Distillation with Few Samples.

Quantisation and Pruning for Neural Network Compression and Regularisation

Pruning at a Glance: Global Neural Pruning for Model Compression

Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM

Harmonious Coexistence of Structured Weight Pruning and Ternarization for Deep Neural Networks

Deep Neural Network Compression Method Based on Product Quantization