Abstract:Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emph{are their learned solutions really equivalent?} We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task's distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.

What problem does this paper attempt to address?

The problem this paper attempts to address is: When fine-tuning pre-trained large language models, do Low-Rank Adaptation (LoRA) and Full Fine-Tuning really learn the same solutions, despite their similar performance on target tasks? Specifically, the authors investigate how different fine-tuning methods alter pre-trained models by analyzing the spectral properties of model weight matrices. They find that although LoRA and Full Fine-Tuning can achieve similar performance on target tasks, the solutions they learn exhibit significant differences in structure and generalization behavior. The main findings include: 1. **Structural Differences**: - LoRA introduces new high-rank singular vectors, referred to as "intruder dimensions," which are approximately orthogonal to the singular vectors of the pre-trained model. - Full Fine-Tuning, on the other hand, maintains the spectral properties of the pre-trained model without introducing intruder dimensions. 2. **Behavioral Differences**: - In continual learning tasks, models fine-tuned with LoRA are more prone to forgetting previously learned tasks, especially in low-rank scenarios. - Despite similar performance on target tasks, models fine-tuned with LoRA perform worse on out-of-distribution tests, whereas models fine-tuned with Full Fine-Tuning are more robust. 3. **Effectiveness of High-Rank LoRA**: - High-rank LoRA models (such as rank-stabilized LoRA) approach the performance of Full Fine-Tuning and exhibit better generalization and adaptability. - Extremely high-rank LoRA models (e.g., full-rank LoRA) also forget more of the pre-trained distribution, indicating a trade-off between expressiveness and generalization in LoRA. Through these studies, the authors reveal the intrinsic differences between LoRA and Full Fine-Tuning across different tasks and settings, providing theoretical explanations for these differences. These findings are significant for understanding the mechanisms of fine-tuning methods and selecting appropriate fine-tuning strategies.

LoRA vs Full Fine-tuning: An Illusion of Equivalence

LoRA Learns Less and Forgets Less

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models

LoRA+: Efficient Low Rank Adaptation of Large Models

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

On Fairness of Low-Rank Adaptation of Large Models

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

The Expressive Power of Low-Rank Adaptation

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

Sparse Low-rank Adaptation of Pre-trained Language Models

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

LoRTA: Low Rank Tensor Adaptation of Large Language Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

FairLoRA: Unpacking Bias Mitigation in Vision Models with Fairness-Driven Low-Rank Adaptation

Exploring Gradient Subspaces: Addressing and Overcoming LoRA's Limitations in Federated Fine-Tuning of Large Language Models