Abstract:Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance. Code is available at <a class="link-external link-https" href="https://github.com/Outsider565/LoRA-GA" rel="external noopener nofollow">this https URL</a>.

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

LoRA Learns Less and Forgets Less

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Sparse Low-rank Adaptation of Pre-trained Language Models

ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers

The Expressive Power of Low-Rank Adaptation

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation

LoRA+: Efficient Low Rank Adaptation of Large Models

HyperLoRA: Efficient Cross-task Generalization Via Constrained Low-Rank Adapters Generation

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

Structure-Aware Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

GeoLoRA: Geometric integration for parameter efficient fine-tuning

SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning