Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Yongchang Hao,Yanshuai Cao,Lili Mou
2024-06-13
Abstract:Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily aims to address the issue of excessive memory usage for optimization states encountered during the training of large neural networks, particularly focusing on the additional memory burden brought by techniques such as gradient accumulation and momentum. Specifically, the paper focuses on the following points: 1. **Reducing Memory Usage**: Large models (such as GPT-3, Stable Diffusion, etc.) require storing a significant amount of optimization state information during training, such as the momentum terms in the Adam optimizer, which significantly increases the required memory. The paper proposes a method to reduce the memory demand of these optimization states. 2. **Improving the LoRA Method**: Low-Rank Adaptation (LoRA) is a strategy to reduce memory usage by updating only a portion of the model's parameters. However, LoRA restricts weight updates to a low-rank form, which may limit model performance. This paper studies the dynamic characteristics of LoRA and finds that it can be approximately viewed as a random projection, leading to the proposal of the FLORA method to achieve high-rank updates while maintaining low memory complexity. 3. **Proposing FLORA**: FLORA is a new optimization technique that uses sublinear memory for gradient accumulation and momentum computation. This method is based on the observation of LoRA's dynamic characteristics, discovering that LoRA can actually be seen as a gradient compression method. FLORA alleviates LoRA's low-rank limitation by continuously resampling the random projection matrix and only storing the compressed gradient accumulation and momentum, thereby significantly saving memory usage. 4. **Experimental Validation**: The paper demonstrates the effectiveness of FLORA through experiments on multiple tasks and different model architectures. Compared to uncompressed full matrix updates, when combined with Adafactor as the base optimizer, FLORA can achieve similar performance while significantly outperforming other compression techniques such as LoRA. In summary, the paper proposes a new method called FLORA to address the memory usage issue during the training of large neural networks, especially when using gradient accumulation and momentum techniques. FLORA not only effectively reduces memory usage but also maintains or improves model performance.