Abstract:Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the significant output error problems that occur during the quantization process of deep neural networks (DNNs). Quantization is an effective method to reduce model memory usage and computational intensity, but it usually leads to significant output errors, thus hindering model deployment. This paper proposes a method named Bias Compensation (BC), which minimizes the output error by adding a bias vector to the output of each quantization layer, enabling ultra - low - precision quantization without fine - tuning the model. ### Main contributions 1. **Proposing the BC method**: From a new perspective, directly optimize the output error by adding an optimizable bias vector to the output of the quantization layer, rather than optimizing the non - convex quantization process as in previous methods. 2. **Proving the convexity of BC optimization**: It is proved that identifying the optimal bias vector for each quantization layer is a convex problem, and the optimal solution can be obtained without fine - tuning, thus ensuring the minimum output error. 3. **Experimental verification**: Extensive experiments were carried out on the Vision Transformer model and large - language models. The results show that BC can significantly reduce the quantization output error and improve the task performance of the model. In particular, BC improves the 4 - bit PTQ4ViT accuracy of ViT - B* by 36.89% on the ImageNet - 1k task and reduces the 3 - bit GPTQ perplexity of OPT - 350M by 5.97 on the WikiText2 task. ### Method overview 1. **Quantization and output error**: - Quantization replaces high - precision floating - point weights with low - precision weights to reduce memory footprint and computational load. - Numerical errors are inevitably generated during the quantization process, and these errors will affect the output of the quantization layer. 2. **Bias compensation**: - Compensate for quantization errors by adding a bias vector to the output of each quantization layer. - The optimization of the bias vector is a convex problem, and the optimal solution can be obtained by analytical solution. - The addition of the bias vector does not affect the quantization and calculation processes, and the computational overhead is extremely small. ### Experimental results - **ViT model**: BC significantly improves the accuracy of the ViT model under different bit - width quantization settings. For example, under 4 - bit quantization, BC improves the accuracy of ViT - B* from 31.40% to 68.29%. - **Large - language model**: BC significantly reduces the perplexity of large - language models. For example, under 3 - bit GPTQ, BC reduces the perplexity of OPT - 350M on WikiText2 from 54.68 to 45.19. ### Conclusion The Bias Compensation (BC) method proposed in this paper effectively reduces the quantization output error and improves the task performance of the model by adding a bias vector to the output of the quantization layer. The BC method not only proves its effectiveness in theory, but also achieves significant improvements in practical applications, especially in the ultra - low - precision quantization scenario.

Minimize Quantization Output Error with Bias Compensation

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Optimizing Quantized Neural Networks in a Weak Curvature Manifold

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

Error-aware Quantization through Noise Tempering

Residual Quantization for Low Bit-Width Neural Networks.

Deep Network Quantization via Error Compensation

Rethinking the Importance of Quantization Bias, Toward Full Low-Bit Training

Data-Free Quantization Through Weight Equalization and Bias Correction

Designing strong baselines for ternary neural network quantization through support and mass equalization

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Improving Low-Precision Network Quantization via Bin Regularization

PTQ-SO: A Scale Optimization-based Approach for Post-training Quantization of Edge Computing

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point