Understanding the difficulty of low-precision post-training quantization of large language models

Zifei Xu,Sayeh Sharify,Wanzin Yazar,Tristan Webb,Xin Wang
2024-10-19
Abstract:Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difference in the effects of Post - Training Quantization (PTQ) methods and Quantization - Aware Fine - Tuning (QAFT) methods in Large Language Models (LLMs) under the same low - precision quantization conditions. Specifically, the author found that when the numerical precision is very low, the PTQ method, which minimizes the quantization error between local layers, usually performs worse than the QAFT method, which minimizes the global loss function. The reason for this phenomenon lies in the significant inconsistency between local and global optimization objectives. ### Main Research Contents 1. **Background Introduction**: - Although large - language models are powerful, they are computationally expensive. Compressing model weights to a lower numerical precision can improve their efficiency. - Two main quantization methods: PTQ and QAFT. - **PTQ**: Achieved by minimizing the quantization error between local layers. - **QAFT**: Achieved by minimizing the global loss function, which requires back - propagation and gradient updates. 2. **Research Motivation**: - Theoretically, when the quantization error is small, minimizing the local loss should be able to minimize the global loss. - In practice, the author found that when the numerical precision is very low, the PTQ method is far less effective than the QAFT method. 3. **Experimental Methods**: - 11 models of different scales were used, including GPT - 2, OPT, and Llama 2. - The WikiText - 2 dataset was used. - Quantization data types include int8, int6, int4, int3, and int2. - Experiments were carried out through GPTQ and QAFT methods, and their performance at different precisions was compared. 4. **Experimental Results**: - **Global NLL Loss**: The QAFT method outperforms the GPTQ method at all precisions. - **Local MSE Loss**: The GPTQ method can effectively reduce the local MSE loss at all precisions, but the QAFT method performs better at low precisions. - **Loss Landscape Analysis**: By plotting the loss landscape graph, it is explained why GPTQ performs poorly at low precisions. The main reason is that the weight perturbation caused by quantization exceeds the range of the attraction basin near the pre - training convergence point, making the GPTQ method unable to find the global optimal solution. 5. **Conclusion**: - Minimizing the local MSE loss does not necessarily reduce the global NLL loss, especially during low - precision quantization. - The QAFT method is more effective during low - precision quantization because it directly optimizes the global loss function. - The author suggests using post - training quantization techniques based on local error minimization with caution in practical applications and provides a new perspective to understand the effectiveness and limitations of these methods. ### Formula Explanation - **Global NLL Loss**: \[ W_{\text{QAFT}}=\arg\min_{W'}\text{NLL}(x|f_Q(W')) \] - **Local MSE Loss**: \[ W_l = \arg\min_{W'_l}\text{MSE}(Q(W'_l)x_l, W_l x_l)=\arg\min_{W'_l}\|Q(W'_l)x_l - W_l x_l\|^2 \] Through these studies, the author provides important guidance and references for future large - scale language model quantization practices.