LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Han Guo,Philip Greengard,Eric P. Xing,Yoon Kim

2024-08-27

Abstract:We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of memory inefficiency in fine-tuning large language models (LLMs). Specifically: 1. **Memory-Efficient Fine-Tuning**: The paper proposes a simple method to improve the memory efficiency of pre-trained language models during fine-tuning. By decomposing each pre-trained matrix into a high-precision low-rank part and a memory-efficient quantized part, only the low-rank part is updated during fine-tuning while keeping the quantized part fixed. 2. **Quantization Strategy Optimization**: The paper further explores methods for dynamically configuring quantization parameters, i.e., allocating different quantization parameters (such as bit width, block size) to each matrix based on the overall target memory budget. This approach allows users to flexibly set the target memory budget. 3. **Data-Aware Algorithm**: To further enhance performance, the paper also explores a data-aware version of the algorithm, which uses the Fisher information matrix to weight the objective function of the matrix decomposition, better capturing changes in important parameters. Experimental results show that the proposed low-rank plus quantized matrix decomposition method (LQ-LoRA) outperforms existing strong baseline methods (such as QLoRA and GPTQ-LoRA) when fine-tuning RoBERTa and LLaMA-2 (7B and 70B), and can achieve quantization to less than 3 bits without significantly degrading performance. Additionally, when fine-tuning on language modeling calibration datasets, LQ-LoRA can also be used for model compression, such as compressing the LLaMA-2-70B model to an average of 2.85 bits, while performing well with GPU memory consumption of 27GB.

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Low-Rank Quantization-Aware Training for LLMs

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates

QLoRA: Efficient Finetuning of Quantized LLMs

FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

RPTQ: Reorder-based Post-training Quantization for Large Language Models

RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization

QuAILoRA: Quantization-Aware Initialization for LoRA

LoQT: Low Rank Adapters for Quantized Training

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

ReALLM: A general framework for LLM compression and fine-tuning

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

LRQuant: Learnable and Robust Post-Training Quantization for Large Language Models

AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning