LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

Longteng Zhang,Lin Zhang,Shaohuai Shi,Xiaowen Chu,Bo Li
2023-08-07
Abstract:The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$ compared to LoRA.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of memory consumption during the fine-tuning process of large language models (LLMs). Specifically, while the Low-Rank Adaptation (LoRA) method reduces the number of trainable parameters, it still requires a significant amount of activation memory to update the low-rank weights. To solve this problem, the paper proposes LoRA-FA (LoRA with Frozen-A), a memory-efficient fine-tuning method. ### Main Issues Addressed 1. **Reducing Activation Memory Consumption**: Although LoRA reduces the number of trainable parameters, it still requires expensive activation memory to update the low-rank weights. LoRA-FA significantly reduces the required activation memory by freezing the projection down weights (A) and only updating the projection up weights (B) in each LoRA layer. 2. **Avoiding Performance Loss**: Selectively reducing LoRA layers or using activation recomputation may affect fine-tuning performance or increase computational overhead. LoRA-FA aims to reduce memory costs while maintaining fine-tuning accuracy. 3. **Improving Computational Efficiency**: LoRA-FA not only reduces the demand for activation memory but also ensures that no additional computational overhead is introduced during the fine-tuning phase, and no latency overhead is introduced during the inference phase. ### Experimental Validation The paper extensively validates the effectiveness of LoRA-FA through various experiments: 1. **Different Model Types and Sizes**: Including different sizes of language models such as RoBERTa, T5, and LLaMA. 2. **Multiple Tasks**: Covering natural language understanding (e.g., GLUE benchmark), machine translation (e.g., WMT16 En-Ro), and natural language generation tasks (e.g., MMLU benchmark). 3. **Memory Efficiency**: Demonstrating that LoRA-FA can reduce overall memory consumption compared to full-parameter fine-tuning and standard LoRA across different models. For example, for the LLaMA-7B model, LoRA-FA reduces memory usage from 56GB to 27.5GB. In summary, LoRA-FA aims to address the activation memory bottleneck in the fine-tuning process of large language models by reducing activation memory requirements while maintaining fine-tuning performance, thereby achieving more efficient model fine-tuning.