Abstract:Model compression is generally performed by using quantization, low-rank approximation or pruning, for which various algorithms have been researched in recent years. One fundamental question is: what types of compression work better for a given model? Or even better: can we improve by combining compressions in a suitable way? We formulate this generally as a problem of optimizing the loss but where the weights are constrained to equal an additive combination of separately compressed parts; and we give an algorithm to learn the corresponding parts' parameters. Experimentally with deep neural nets, we observe that 1) we can find significantly better models in the error-compression space, indicating that different compression types have complementary benefits, and 2) the best type of combination depends exquisitely on the type of neural net. For example, we can compress ResNets and AlexNet using only 1 bit per weight without error degradation at the cost of adding a few floating point weights. However, VGG nets can be better compressed by combining low-rank with a few floating point weights.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how different types of compression techniques can be combined for better results in neural network model compression. Specifically, the author proposes a framework that realizes the combined application of multiple compression techniques (such as quantization, low - rank approximation, and pruning) by optimizing the loss function and imposing constraints on the weight parameters. This method aims to find a model with a better balance between error rate and compression rate and can select the most suitable compression combination method according to different neural network types. ### Main contributions of the paper 1. **Proposing a new model compression method**: By representing the model weights as a weighted combination of multiple individually compressed parts, the author provides a new model compression method. This method not only includes the independent application of each technique but also allows them to cooperate with each other and utilize their respective advantages. 2. **Design of the optimization algorithm**: To achieve the above - mentioned goals, the author designs a learning - compression (LC) algorithm, which can effectively perform the combined optimization of multiple compression techniques while maintaining the model performance. This involves the process of alternately optimizing the model weights and compression parameters. 3. **Experimental verification**: By conducting experiments on ResNet and VGG16 models of different depths, the author demonstrates the effectiveness of the proposed additive combination compression method. The experimental results show that compared with a single compression technique, the combined compression can significantly improve the compression ratio of the model while maintaining or even improving the model's prediction performance. ### Technical details - **Model compression as a constrained optimization problem**: The author formulates the model compression problem as a constrained optimization problem, in which the model weights are restricted to be the weighted sum of multiple compressed parts. For example, a weight matrix \(W\) can be represented as \(W = W_1+W_2 + W_3\), where \(W_1\) is a low - rank matrix, \(W_2\) is a sparse matrix, and \(W_3\) is a quantization matrix. - **Advantages of additive combination**: - **Including special cases of individual techniques**: When some compression parts are zero, the additive combination degenerates into the application of a single compression technique. - **Complementary advantages**: Different compression techniques can complement each other. For example, pruning can be regarded as adding a small amount of real - value correction to the quantized or low - rank weight matrix. - **Expanding the compression space**: The additive combination greatly expands the parameter subspace that can be losslessly compressed. - **Hardware implementation**: By applying each compression technique sequentially and cumulatively, the additive combination can be efficiently implemented on actual hardware. For example, when calculating the output of a neural network layer, the memory access and the amount of calculation can be reduced by first calculating the low - rank part and then accumulating the sparse part. ### Experimental results - **Experiments on the CIFAR - 10 dataset**: The author conducts experiments on ResNet and VGG16 models of different depths on the CIFAR - 10 dataset, demonstrating the effectiveness of the additive combination compression method on different network structures. The experimental results show that this method can significantly improve the compression ratio of the model while maintaining or improving the model's prediction performance. - **Comparison with other methods**: Compared with the single compression techniques in the literature, the proposed additive combination compression method performs well in both compression ratio and prediction performance. For example, for the ResNet20 model, 1 - bit quantization plus 3% pruning correction can achieve a compression ratio of 13.84 times, with an error of only 8.26%. ### Conclusion This paper successfully solves the problem of how to effectively combine multiple compression techniques to improve the compression effect of neural network models by proposing a new additive combination compression method. The experimental results verify the effectiveness of this method, providing new ideas and tools for future model compression research.

Model compression as constrained optimization, with application to neural nets. Part V: combining compressions

Model compression as constrained optimization, with application to neural nets. Part II: quantization

Improved Model Compression Method Based on Information Entropy

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

On Model Compression for Neural Networks: Framework, Algorithm, and Convergence Guarantee

Deep learning model compression using network sensitivity and gradients

On Compressing Deep Models by Low Rank and Sparse Decomposition.

Hyper-Compression: Model Compression via Hyperfunction

Efficient Neural Compression with Inference-time Decoding

Towards Optimal Compression: Joint Pruning and Quantization

ZipNN: Lossless Compression for AI Models

Lossy and Lossless (L$^2$) Post-training Model Size Compression

Deep Learning Model Compression Techniques: Advances, Opportunities, and Perspective

Neural Network Compression Via Sparse Optimization

Neural Network Compression by Joint Sparsity Promotion and Redundancy Reduction

A Highly Efficient Training-Aware Convolutional Neural Network Compression Paradigm

Efficient Model Compression for Bayesian Neural Networks

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Order of Compression: A Systematic and Optimal Sequence to Combinationally Compress CNN

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

A Survey of Model Compression for Deep Neural Networks