Abstract:Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the difference in the effects of Post - Training Quantization (PTQ) methods and Quantization - Aware Fine - Tuning (QAFT) methods in Large Language Models (LLMs) under the same low - precision quantization conditions. Specifically, the author found that when the numerical precision is very low, the PTQ method, which minimizes the quantization error between local layers, usually performs worse than the QAFT method, which minimizes the global loss function. The reason for this phenomenon lies in the significant inconsistency between local and global optimization objectives. ### Main Research Contents 1. **Background Introduction**: - Although large - language models are powerful, they are computationally expensive. Compressing model weights to a lower numerical precision can improve their efficiency. - Two main quantization methods: PTQ and QAFT. - **PTQ**: Achieved by minimizing the quantization error between local layers. - **QAFT**: Achieved by minimizing the global loss function, which requires back - propagation and gradient updates. 2. **Research Motivation**: - Theoretically, when the quantization error is small, minimizing the local loss should be able to minimize the global loss. - In practice, the author found that when the numerical precision is very low, the PTQ method is far less effective than the QAFT method. 3. **Experimental Methods**: - 11 models of different scales were used, including GPT - 2, OPT, and Llama 2. - The WikiText - 2 dataset was used. - Quantization data types include int8, int6, int4, int3, and int2. - Experiments were carried out through GPTQ and QAFT methods, and their performance at different precisions was compared. 4. **Experimental Results**: - **Global NLL Loss**: The QAFT method outperforms the GPTQ method at all precisions. - **Local MSE Loss**: The GPTQ method can effectively reduce the local MSE loss at all precisions, but the QAFT method performs better at low precisions. - **Loss Landscape Analysis**: By plotting the loss landscape graph, it is explained why GPTQ performs poorly at low precisions. The main reason is that the weight perturbation caused by quantization exceeds the range of the attraction basin near the pre - training convergence point, making the GPTQ method unable to find the global optimal solution. 5. **Conclusion**: - Minimizing the local MSE loss does not necessarily reduce the global NLL loss, especially during low - precision quantization. - The QAFT method is more effective during low - precision quantization because it directly optimizes the global loss function. - The author suggests using post - training quantization techniques based on local error minimization with caution in practical applications and provides a new perspective to understand the effectiveness and limitations of these methods. ### Formula Explanation - **Global NLL Loss**: \[ W_{\text{QAFT}}=\arg\min_{W'}\text{NLL}(x|f_Q(W')) \] - **Local MSE Loss**: \[ W_l = \arg\min_{W'_l}\text{MSE}(Q(W'_l)x_l, W_l x_l)=\arg\min_{W'_l}\|Q(W'_l)x_l - W_l x_l\|^2 \] Through these studies, the author provides important guidance and references for future large - scale language model quantization practices.

Understanding the difficulty of low-precision post-training quantization of large language models

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

When Quantization Affects Confidence of Large Language Models?

Post Training Quantization of Large Language Models with Microscaling Formats

Scaling laws for post-training quantized large language models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

Interactions Across Blocks in Post-Training Quantization of Large Language Models

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Post-training Quantization or Quantization-aware Training? That is the Question

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Scaling Laws for Mixed quantization in Large Language Models

Understanding the Impact of Post-Training Quantization on Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Evaluating Quantized Large Language Models

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation

What Makes Quantization for Large Language Models Hard? an Empirical Study from the Lens of Perturbation