Minimize Quantization Output Error with Bias Compensation

Cheng Gong,Haoshuai Zheng,Mengting Hu,Zheng Lin,Deng-Ping Fan,Yuzhi Zhang,Tao Li
2024-04-02
Abstract:Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the significant output error problems that occur during the quantization process of deep neural networks (DNNs). Quantization is an effective method to reduce model memory usage and computational intensity, but it usually leads to significant output errors, thus hindering model deployment. This paper proposes a method named Bias Compensation (BC), which minimizes the output error by adding a bias vector to the output of each quantization layer, enabling ultra - low - precision quantization without fine - tuning the model. ### Main contributions 1. **Proposing the BC method**: From a new perspective, directly optimize the output error by adding an optimizable bias vector to the output of the quantization layer, rather than optimizing the non - convex quantization process as in previous methods. 2. **Proving the convexity of BC optimization**: It is proved that identifying the optimal bias vector for each quantization layer is a convex problem, and the optimal solution can be obtained without fine - tuning, thus ensuring the minimum output error. 3. **Experimental verification**: Extensive experiments were carried out on the Vision Transformer model and large - language models. The results show that BC can significantly reduce the quantization output error and improve the task performance of the model. In particular, BC improves the 4 - bit PTQ4ViT accuracy of ViT - B* by 36.89% on the ImageNet - 1k task and reduces the 3 - bit GPTQ perplexity of OPT - 350M by 5.97 on the WikiText2 task. ### Method overview 1. **Quantization and output error**: - Quantization replaces high - precision floating - point weights with low - precision weights to reduce memory footprint and computational load. - Numerical errors are inevitably generated during the quantization process, and these errors will affect the output of the quantization layer. 2. **Bias compensation**: - Compensate for quantization errors by adding a bias vector to the output of each quantization layer. - The optimization of the bias vector is a convex problem, and the optimal solution can be obtained by analytical solution. - The addition of the bias vector does not affect the quantization and calculation processes, and the computational overhead is extremely small. ### Experimental results - **ViT model**: BC significantly improves the accuracy of the ViT model under different bit - width quantization settings. For example, under 4 - bit quantization, BC improves the accuracy of ViT - B* from 31.40% to 68.29%. - **Large - language model**: BC significantly reduces the perplexity of large - language models. For example, under 3 - bit GPTQ, BC reduces the perplexity of OPT - 350M on WikiText2 from 54.68 to 45.19. ### Conclusion The Bias Compensation (BC) method proposed in this paper effectively reduces the quantization output error and improves the task performance of the model by adding a bias vector to the output of the quantization layer. The BC method not only proves its effectiveness in theory, but also achieves significant improvements in practical applications, especially in the ultra - low - precision quantization scenario.