LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Han Guo,Philip Greengard,Eric P. Xing,Yoon Kim
2024-08-27
Abstract:We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of memory inefficiency in fine-tuning large language models (LLMs). Specifically: 1. **Memory-Efficient Fine-Tuning**: The paper proposes a simple method to improve the memory efficiency of pre-trained language models during fine-tuning. By decomposing each pre-trained matrix into a high-precision low-rank part and a memory-efficient quantized part, only the low-rank part is updated during fine-tuning while keeping the quantized part fixed. 2. **Quantization Strategy Optimization**: The paper further explores methods for dynamically configuring quantization parameters, i.e., allocating different quantization parameters (such as bit width, block size) to each matrix based on the overall target memory budget. This approach allows users to flexibly set the target memory budget. 3. **Data-Aware Algorithm**: To further enhance performance, the paper also explores a data-aware version of the algorithm, which uses the Fisher information matrix to weight the objective function of the matrix decomposition, better capturing changes in important parameters. Experimental results show that the proposed low-rank plus quantized matrix decomposition method (LQ-LoRA) outperforms existing strong baseline methods (such as QLoRA and GPTQ-LoRA) when fine-tuning RoBERTa and LLaMA-2 (7B and 70B), and can achieve quantization to less than 3 bits without significantly degrading performance. Additionally, when fine-tuning on language modeling calibration datasets, LQ-LoRA can also be used for model compression, such as compressing the LLaMA-2-70B model to an average of 2.85 bits, while performing well with GPU memory consumption of 27GB.