A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Damjan Kalajdzievski
2023-11-28
Abstract:As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses an issue encountered when using LoRA (Low-Rank Adapters) for fine-tuning large language models (LLMs), specifically that as the rank of the LoRA adapter increases, the learning process becomes unstable, leading to limited performance improvements. Specifically, the paper points out that in the LoRA method, low-rank adapters are multiplied by a scaling factor that depends on the rank. In traditional LoRA implementations, this factor is the inverse of the rank (\( \gamma_r = \alpha / r \)), which causes the gradient to gradually diminish as the adapter rank increases (a phenomenon known as "gradient collapse"), thereby preventing higher-rank adapters from fully utilizing their additional parameters to enhance performance. Consequently, in practice, LoRA is typically restricted to very low ranks. To address this issue, the paper proposes a new method—rank-stabilized LoRA (rsLoRA), where the scaling factor of the adapter is modified to be the inverse of the square root of the rank (\( \gamma_r = \alpha / \sqrt{r} \)). Through theoretical analysis and experimental validation, the authors demonstrate that this new scaling factor can stabilize the learning process, maintaining good performance even for higher ranks. This means that users can choose higher ranks based on available computational resources, thereby achieving a better trade-off between training cost and performance without altering the inference cost. In summary, the goal of the paper is to improve the LoRA method so that stable fine-tuning performance can be achieved even with higher ranks, thereby enhancing the adaptability and efficiency of the model.