Abstract:Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of accuracy degradation of large - language models (LLMs) under low - precision quantization, especially 2 - bit quantization. Specifically, although the low - rank adaptation (LoRA) and LoRA - based quantization error compensation (LQEC) methods perform well in quantization of 4 bits and above, the effectiveness of these methods drops significantly in 2 - bit quantization. The paper points out that existing methods require a higher rank to effectively compensate for errors in 2 - bit quantization, which contradicts the low - rank premise of LoRA. ### Main contributions of the paper 1. **Propose the RILQ method**: - RILQ (Rank - Insensitive LoRA - based Quantization Error Compensation) is a new quantization error compensation method, aiming to overcome the high - rank requirement in 2 - bit quantization through the model - level loss function (Model - Loss). - RILQ achieves more effective quantization error compensation by global adapter adjustment and balancing compensation between different layers. 2. **Analyze the characteristics of quantization errors**: - The paper proves through experiments that the errors introduced by 2 - bit quantization are essentially high - rank, and the existing SVD - based low - rank adaptation techniques are difficult to deal with effectively. - It proposes rank - sensitivity analysis, revealing the impact of the quantization error range on the performance of LQEC. 3. **Experimental verification**: - The effectiveness of RILQ is evaluated on multiple benchmark datasets, including common question - answering tasks (such as WinoGrande, PIQA, Hellaswag, etc.) and arithmetic reasoning tasks (GSM8K). - The experimental results show that RILQ significantly improves the model accuracy under 2 - bit quantization and also performs excellently in task - specific fine - tuning. ### Formula representation The formulas involved in the paper are as follows: - **Quantization weight formula**: \[ Q_b = s\cdot\text{clamp}\left(\left\lfloor\frac{W}{s}\right\rfloor - z,0,2^N - 1\right)+z \] where \[ s=\frac{\gamma\max(W)-\beta\min(W)}{2^b - 1},\quad z = \left\lfloor\frac{\beta\min(W)}{s}\right\rfloor \] - **LoRA forward operation**: \[ Y = X(W + L_1L_2^T) \] - **Optimization objective**: \[ \arg\min_{L_1,L_2}\|Y_N - Y_q^N\|_F \] where \(Y_N\) is the full - precision activation output and \(Y_q^N\) is the quantized activation output. Through these improvements, RILQ can significantly improve the accuracy of LLM under 2 - bit quantization while maintaining computational efficiency.

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

LQER: Low-Rank Quantization Error Reconstruction for LLMs

Low-Rank Quantization-Aware Training for LLMs

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

QuAILoRA: Quantization-Aware Initialization for LoRA

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Low-Rank Correction for Quantized LLMs

INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

QuIP: 2-Bit Quantization of Large Language Models With Guarantees