Abstract:Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.

What problem does this paper attempt to address?

The paper primarily aims to address the issue of excessive memory usage for optimization states encountered during the training of large neural networks, particularly focusing on the additional memory burden brought by techniques such as gradient accumulation and momentum. Specifically, the paper focuses on the following points: 1. **Reducing Memory Usage**: Large models (such as GPT-3, Stable Diffusion, etc.) require storing a significant amount of optimization state information during training, such as the momentum terms in the Adam optimizer, which significantly increases the required memory. The paper proposes a method to reduce the memory demand of these optimization states. 2. **Improving the LoRA Method**: Low-Rank Adaptation (LoRA) is a strategy to reduce memory usage by updating only a portion of the model's parameters. However, LoRA restricts weight updates to a low-rank form, which may limit model performance. This paper studies the dynamic characteristics of LoRA and finds that it can be approximately viewed as a random projection, leading to the proposal of the FLORA method to achieve high-rank updates while maintaining low memory complexity. 3. **Proposing FLORA**: FLORA is a new optimization technique that uses sublinear memory for gradient accumulation and momentum computation. This method is based on the observation of LoRA's dynamic characteristics, discovering that LoRA can actually be seen as a gradient compression method. FLORA alleviates LoRA's low-rank limitation by continuously resampling the random projection matrix and only storing the compressed gradient accumulation and momentum, thereby significantly saving memory usage. 4. **Experimental Validation**: The paper demonstrates the effectiveness of FLORA through experiments on multiple tasks and different model architectures. Compared to uncompressed full matrix updates, when combined with Adafactor as the base optimizer, FLORA can achieve similar performance while significantly outperforming other compression techniques such as LoRA. In summary, the paper proposes a new method called FLORA to address the memory usage issue during the training of large neural networks, especially when using gradient accumulation and momentum techniques. FLORA not only effectively reduces memory usage but also maintains or improves model performance.

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Low-Rank Interconnected Adaptation across Layers

Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

HyperLoRA: Efficient Cross-task Generalization Via Constrained Low-Rank Adapters Generation

FLoCoRA: Federated learning compression with low-rank adaptation

The Expressive Power of Low-Rank Adaptation

GeoLoRA: Geometric integration for parameter efficient fine-tuning

LoRA+: Efficient Low Rank Adaptation of Large Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

FLoRA: Low-Rank Core Space for N-dimension

FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer

Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation

Batched Low-Rank Adaptation of Foundation Models

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

Optimizing Low-Rank Adaptation with Decomposed Matrices and Adaptive Rank Allocation

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning