LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Jung Hyun Lee,Jeonghoon Kim,June Yong Yang,Se Jung Kwon,Eunho Yang,Kang Min Yoo,Dongsoo Lee
2024-07-16
Abstract:With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) $-$ a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at \url{<a class="link-external link-https" href="https://github.com/onliwad101/FlexRound_LRQ" rel="external noopener nofollow">this https URL</a>} to inspire LLM researchers and engineers.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the significant performance degradation of large language models (LLMs) during the quantization process. Specifically, existing post-training quantization (PTQ) techniques lead to non-negligible accuracy loss when quantizing the weights and activations of LLMs, especially when handling large-scale multi-task language understanding tasks. To tackle this challenge, the authors propose a novel post-training quantization method—Low-Rank Quantization (LRQ), which enhances the generalization ability of quantized LLMs by reconstructing the outputs of intermediate Transformer blocks using low-rank weight scaling matrices. ### Background and Motivation 1. **Background**: - Large language models (LLMs) like ChatGPT and GPT-4 have demonstrated outstanding performance in areas such as commonsense reasoning, mathematical problem-solving, and programming capabilities, but these models require substantial memory and computational resources when running in FP16 format. - Quantization techniques are widely used to compress and accelerate LLMs, reducing inference costs and improving throughput. Quantization techniques are mainly divided into weight quantization and weight-activation quantization. - Weight-activation quantization significantly accelerates computation-intensive operations, such as matrix-matrix multiplication, by quantizing weights and activations into low-bit integers (e.g., 8-bit integers) and using INT8 GEMM kernels. However, this method may lead to significant accuracy loss. 2. **Motivation**: - Existing quantization methods like SmoothQuant and FlexRound perform well on certain tasks but still face performance degradation issues in complex benchmarks such as large-scale multi-task language understanding (MMLU). - The authors believe that the main reason for these issues is that existing methods require learning independent scaling factors for each weight, which can easily lead to overfitting with limited calibration samples. - To address this problem, the authors propose Low-Rank Quantization (LRQ), which reduces the number of learnable parameters by using low-rank weight scaling matrices, thereby improving the generalization ability of quantized LLMs. ### Method Overview 1. **Low-Rank Quantization (LRQ)**: - LRQ reduces the number of learnable parameters by decomposing the full-rank weight scaling matrix into low-rank matrices. Specifically, for a weight matrix \( W \), the weight scaling matrix \( S_2 \) is decomposed into the product of \( L_2 \) and \( U_2 \), where \( L_2 \) and \( U_2 \) have lower ranks. - This method not only reduces the number of learnable parameters but also maintains individual scaling for weights, thereby enhancing the generalization ability of the quantized model. 2. **Experimental Results**: - The authors conducted experiments on multiple LLM models (e.g., Llama), and the results show that LRQ achieves performance comparable to the FP16 baseline in common commonsense reasoning tasks and large-scale multi-task language understanding (MMLU) benchmarks. - Compared to existing quantization methods, LRQ performs well under different quantization schemes (e.g., 8-bit weight and tensor activation quantization, 4-bit weight and 8-bit per-token activation quantization, low-bit weight quantization) with minimal accuracy loss. ### Main Contributions 1. Proposing a novel post-training weight quantization method—Low-Rank Quantization (LRQ), which improves the generalization performance of quantized LLMs through low-rank weight scaling matrices. 2. Empirically analyzing the importance of reducing the number of learnable parameters and exploring how low-rank matrices affect the generalization ability of quantized LLMs. 3. Validating the effectiveness of LRQ under various quantization schemes, demonstrating its superior performance across different tasks. Through these contributions, LRQ provides an effective method for improving the performance of large language models after quantization, facilitating efficient and low-resource model deployment.