Abstract:Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.

What problem does this paper attempt to address?

The paper primarily explores the comparison between two key methods in neural network compression technology—pruning and quantization—and attempts to answer which method performs better in compressing neural networks. The paper first reviews the historical background and current development of pruning and quantization techniques, pointing out that although these two techniques are almost as old as neural networks themselves, systematic comparative studies between them are still rare. To fill this gap, the authors aim to conduct an in-depth comparative analysis of these two methods from both theoretical and technical practice perspectives, with the goal of providing guidance for the design of future neural network hardware. Specifically, the main contributions of the paper include the following aspects: 1. **Theoretical Analysis**: The authors conduct a theoretical analysis of the expected errors of pruning and quantization, exploring the performance differences of the two methods under different compression ratios for various data distributions (such as standard normal distribution, heavy-tailed distributions, etc.). 2. **Experimental Validation**: A large number of experimental studies are conducted to validate the theoretical analysis results, including: - Statistical analysis on the weight tensors of pre-trained models; - Comparison in post-training quantization (PTQ) scenarios for single-layer networks; - Full model comparison in fine-tuning scenarios. 3. **Conclusion**: The paper finds that in most cases, quantization techniques outperform pruning techniques. Particularly at moderate compression ratios, quantization usually achieves higher accuracy. Only at very high compression ratios might pruning be more advantageous in certain scenarios from an accuracy perspective. 4. **Additional Discussion**: Besides the main technical comparison, the paper briefly discusses the considerations of pruning and quantization in practical hardware implementations, although this part is not the focus of the paper. In summary, the core issue the paper attempts to address is determining which method, pruning or quantization, is superior in the process of neural network compression, and it provides strong support through theoretical analysis and extensive experimental evidence.

Pruning vs Quantization: Which is Better?

Pruning vs Quantization: Which is Better?

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

Automatic Pruning for Quantized Neural Networks

Pruning and quantization for deep neural network acceleration: A survey

Quantisation and Pruning for Neural Network Compression and Regularisation

Class-Aware Pruning for Efficient Neural Networks

Differentiable Joint Pruning and Quantization for Hardware Efficiency

Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

A Probabilistic Approach to Neural Network Pruning

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

What Do Compressed Deep Neural Networks Forget?

What is the State of Neural Network Pruning?

Fine Granularity Is Critical for Intelligent Neural Network Pruning

Pruning Ternary Quantization

An Information-Theoretic Justification for Model Pruning

Towards Optimal Compression: Joint Pruning and Quantization

Investigating the Effect of Network Pruning on Performance and Interpretability

Learning Low Resource Consumption CNN through Pruning and Quantization

A White Paper on Neural Network Quantization