LoRA vs Full Fine-tuning: An Illusion of Equivalence

Reece Shuttleworth,Jacob Andreas,Antonio Torralba,Pratyusha Sharma
2024-10-29
Abstract:Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emph{are their learned solutions really equivalent?} We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task's distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is: When fine-tuning pre-trained large language models, do Low-Rank Adaptation (LoRA) and Full Fine-Tuning really learn the same solutions, despite their similar performance on target tasks? Specifically, the authors investigate how different fine-tuning methods alter pre-trained models by analyzing the spectral properties of model weight matrices. They find that although LoRA and Full Fine-Tuning can achieve similar performance on target tasks, the solutions they learn exhibit significant differences in structure and generalization behavior. The main findings include: 1. **Structural Differences**: - LoRA introduces new high-rank singular vectors, referred to as "intruder dimensions," which are approximately orthogonal to the singular vectors of the pre-trained model. - Full Fine-Tuning, on the other hand, maintains the spectral properties of the pre-trained model without introducing intruder dimensions. 2. **Behavioral Differences**: - In continual learning tasks, models fine-tuned with LoRA are more prone to forgetting previously learned tasks, especially in low-rank scenarios. - Despite similar performance on target tasks, models fine-tuned with LoRA perform worse on out-of-distribution tests, whereas models fine-tuned with Full Fine-Tuning are more robust. 3. **Effectiveness of High-Rank LoRA**: - High-rank LoRA models (such as rank-stabilized LoRA) approach the performance of Full Fine-Tuning and exhibit better generalization and adaptability. - Extremely high-rank LoRA models (e.g., full-rank LoRA) also forget more of the pre-trained distribution, indicating a trade-off between expressiveness and generalization in LoRA. Through these studies, the authors reveal the intrinsic differences between LoRA and Full Fine-Tuning across different tasks and settings, providing theoretical explanations for these differences. These findings are significant for understanding the mechanisms of fine-tuning methods and selecting appropriate fine-tuning strategies.