Abstract:The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$ compared to LoRA.

What problem does this paper attempt to address?

The paper primarily addresses the issue of memory consumption during the fine-tuning process of large language models (LLMs). Specifically, while the Low-Rank Adaptation (LoRA) method reduces the number of trainable parameters, it still requires a significant amount of activation memory to update the low-rank weights. To solve this problem, the paper proposes LoRA-FA (LoRA with Frozen-A), a memory-efficient fine-tuning method. ### Main Issues Addressed 1. **Reducing Activation Memory Consumption**: Although LoRA reduces the number of trainable parameters, it still requires expensive activation memory to update the low-rank weights. LoRA-FA significantly reduces the required activation memory by freezing the projection down weights (A) and only updating the projection up weights (B) in each LoRA layer. 2. **Avoiding Performance Loss**: Selectively reducing LoRA layers or using activation recomputation may affect fine-tuning performance or increase computational overhead. LoRA-FA aims to reduce memory costs while maintaining fine-tuning accuracy. 3. **Improving Computational Efficiency**: LoRA-FA not only reduces the demand for activation memory but also ensures that no additional computational overhead is introduced during the fine-tuning phase, and no latency overhead is introduced during the inference phase. ### Experimental Validation The paper extensively validates the effectiveness of LoRA-FA through various experiments: 1. **Different Model Types and Sizes**: Including different sizes of language models such as RoBERTa, T5, and LLaMA. 2. **Multiple Tasks**: Covering natural language understanding (e.g., GLUE benchmark), machine translation (e.g., WMT16 En-Ro), and natural language generation tasks (e.g., MMLU benchmark). 3. **Memory Efficiency**: Demonstrating that LoRA-FA can reduce overall memory consumption compared to full-parameter fine-tuning and standard LoRA across different models. For example, for the LLaMA-7B model, LoRA-FA reduces memory usage from 56GB to 27.5GB. In summary, LoRA-FA aims to address the activation memory bottleneck in the fine-tuning process of large language models by reducing activation memory requirements while maintaining fine-tuning performance, thereby achieving more efficient model fine-tuning.

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

LoRA Learns Less and Forgets Less

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

FanLoRA: Fantastic LoRAs and Where to Find Them in Large Language Model Fine-tuning

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

LoRA+: Efficient Low Rank Adaptation of Large Models

ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers

PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices