Abstract:With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) $-$ a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at \url{<a class="link-external link-https" href="https://github.com/onliwad101/FlexRound_LRQ" rel="external noopener nofollow">this https URL</a>} to inspire LLM researchers and engineers.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the significant performance degradation of large language models (LLMs) during the quantization process. Specifically, existing post-training quantization (PTQ) techniques lead to non-negligible accuracy loss when quantizing the weights and activations of LLMs, especially when handling large-scale multi-task language understanding tasks. To tackle this challenge, the authors propose a novel post-training quantization method—Low-Rank Quantization (LRQ), which enhances the generalization ability of quantized LLMs by reconstructing the outputs of intermediate Transformer blocks using low-rank weight scaling matrices. ### Background and Motivation 1. **Background**: - Large language models (LLMs) like ChatGPT and GPT-4 have demonstrated outstanding performance in areas such as commonsense reasoning, mathematical problem-solving, and programming capabilities, but these models require substantial memory and computational resources when running in FP16 format. - Quantization techniques are widely used to compress and accelerate LLMs, reducing inference costs and improving throughput. Quantization techniques are mainly divided into weight quantization and weight-activation quantization. - Weight-activation quantization significantly accelerates computation-intensive operations, such as matrix-matrix multiplication, by quantizing weights and activations into low-bit integers (e.g., 8-bit integers) and using INT8 GEMM kernels. However, this method may lead to significant accuracy loss. 2. **Motivation**: - Existing quantization methods like SmoothQuant and FlexRound perform well on certain tasks but still face performance degradation issues in complex benchmarks such as large-scale multi-task language understanding (MMLU). - The authors believe that the main reason for these issues is that existing methods require learning independent scaling factors for each weight, which can easily lead to overfitting with limited calibration samples. - To address this problem, the authors propose Low-Rank Quantization (LRQ), which reduces the number of learnable parameters by using low-rank weight scaling matrices, thereby improving the generalization ability of quantized LLMs. ### Method Overview 1. **Low-Rank Quantization (LRQ)**: - LRQ reduces the number of learnable parameters by decomposing the full-rank weight scaling matrix into low-rank matrices. Specifically, for a weight matrix $ W $, the weight scaling matrix $ S_2 $ is decomposed into the product of $ L_2 $ and $ U_2 $, where $ L_2 $ and $ U_2 $ have lower ranks. - This method not only reduces the number of learnable parameters but also maintains individual scaling for weights, thereby enhancing the generalization ability of the quantized model. 2. **Experimental Results**: - The authors conducted experiments on multiple LLM models (e.g., Llama), and the results show that LRQ achieves performance comparable to the FP16 baseline in common commonsense reasoning tasks and large-scale multi-task language understanding (MMLU) benchmarks. - Compared to existing quantization methods, LRQ performs well under different quantization schemes (e.g., 8-bit weight and tensor activation quantization, 4-bit weight and 8-bit per-token activation quantization, low-bit weight quantization) with minimal accuracy loss. ### Main Contributions 1. Proposing a novel post-training weight quantization method—Low-Rank Quantization (LRQ), which improves the generalization performance of quantized LLMs through low-rank weight scaling matrices. 2. Empirically analyzing the importance of reducing the number of learnable parameters and exploring how low-rank matrices affect the generalization ability of quantized LLMs. 3. Validating the effectiveness of LRQ under various quantization schemes, demonstrating its superior performance across different tasks. Through these contributions, LRQ provides an effective method for improving the performance of large language models after quantization, facilitating efficient and low-resource model deployment.

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Low-Rank Quantization-Aware Training for LLMs

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

LQER: Low-Rank Quantization Error Reconstruction for LLMs

AffineQuant: Affine Transformation Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

Evaluating Quantized Large Language Models

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization