Towards Optimal Compression: Joint Pruning and Quantization

Ben Zandonati,Glenn Bucagu,Adrian Alan Pol,Maurizio Pierini,Olya Sirkin,Tal Kopetz

DOI: https://doi.org/10.48550/arXiv.2302.07612

2023-06-11

Abstract:Model compression is instrumental in optimizing deep neural network inference on resource-constrained hardware. The prevailing methods for network compression, namely quantization and pruning, have been shown to enhance efficiency at the cost of performance. Determining the most effective quantization and pruning strategies for individual layers and parameters remains a challenging problem, often requiring computationally expensive and ad hoc numerical optimization techniques. This paper introduces FITCompress, a novel method integrating layer-wise mixed-precision quantization and unstructured pruning using a unified heuristic approach. By leveraging the Fisher Information Metric and path planning through compression space, FITCompress optimally selects a combination of pruning mask and mixed-precision quantization configuration for a given pre-trained model and compression constraint. Experiments on computer vision and natural language processing benchmarks demonstrate that our proposed approach achieves a superior compression-performance trade-off compared to existing state-of-the-art methods. FITCompress stands out for its principled derivation, making it versatile across tasks and network architectures, and represents a step towards achieving optimal compression for neural networks.

Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to optimize the efficiency of deep neural network (DNN) inference on resource - constrained hardware. Specifically, the paper focuses on how to effectively combine pruning and quantization techniques to compress neural network models while maintaining model performance. Although existing pruning and quantization methods can improve efficiency, they usually come at the cost of sacrificing performance. Determining the optimal quantization and pruning strategies for each layer and parameter remains a challenge and usually requires a large amount of computational resources and specific methods for numerical optimization. The paper introduces a new method named FITCompress. This method jointly uses inter - layer mixed - precision quantization and unstructured pruning, and utilizes a unified heuristic method to select the best pruning mask and mixed - precision quantization configuration under given pre - trained models and compression constraints. By leveraging the Fisher Information Metric (FIM) and path planning in the compression space, FITCompress can minimize the performance degradation of pre - trained models while meeting resource constraints, thereby achieving the optimal compression effect. Overall, the paper aims to propose a general and efficient method to address the challenges currently encountered in model compression, especially the applicability to different tasks and network architectures, and to improve the trade - off between performance and compression at significant compression levels.

Towards Optimal Compression: Joint Pruning and Quantization

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing.

Regularized Training Framework for Combining Pruning and Quantization to Compress Neural Networks

Automated Model Compression by Jointly Applied Pruning and Quantization

Pruning and quantization for deep neural network acceleration: A survey

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization

HFPQ: Deep Neural Network Compression by Hardware-Friendly Pruning-Quantization

Hybrid Network Compression Via Meta-Learning

Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

Towards Hardware-Specific Automatic Compression of Neural Networks

Differentiable Joint Pruning and Quantization for Hardware Efficiency

Quantisation and Pruning for Neural Network Compression and Regularisation

Pruning at a Glance: Global Neural Pruning for Model Compression

Pruning Ternary Quantization

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates

Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

Neural Network Compression using Binarization and Few Full-Precision Weights

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding