Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), greatly reducing memory usage but resulting in noticeable performance degradation. In this paper, we identify an imbalance in fine-tuning quantized pre-trained models: overly complex adapter inputs and outputs versus low effective trainability of the adaptation. We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA), which simplifies the adapter inputs and outputs while increasing the adapter's rank to achieve a more suitable balance for fine-tuning quantized LLMs. Additionally, for scenarios where fine-tuned LLMs need to be deployed as low-precision inference models, we introduce Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA), which simplifies the adapter inputs and outputs to align with the pre-trained model's block-wise quantization while employing a single matrix to achieve a higher rank. Both Q-BaRA and QA-HiRA are easily implemented and offer the following optimizations: (i) Q-BaRA consistently achieves the highest accuracy compared to baselines and other variants, requiring the same number of trainable parameters and computational effort; (ii) QA-HiRA naturally merges adapter parameters into the block-wise quantized model after fine-tuning, achieving the highest accuracy compared to other methods. We apply our Q-BaRA and QA-HiRA to the LLaMA and LLaMA2 model families and validate their effectiveness across different fine-tuning datasets and downstream scenarios. Code will be made available at \href{<a class="link-external link-https" href="https://github.com/xiaocaigou/qbaraqahira" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/xiaocaigou/qbaraqahira" rel="external noopener nofollow">this https URL</a>}

Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models

Bayesian Low-rank Adaptation for Large Language Models

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

SBoRA: Low-Rank Adaptation with Regional Weight Updates

LoRA ensembles for large language model fine-tuning

Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging

OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models

Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients

Variational Low-Rank Adaptation Using IVON

BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models

Exploring Gradient Subspaces: Addressing and Overcoming LoRA's Limitations in Federated Fine-Tuning of Large Language Models

Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices