Abstract:Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Bayesian Low-rank Adaptation for Large Language Models

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models

Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-task Large Language Models

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

Large Language Models to Enhance Bayesian Optimization

Hyperparameter Optimization for Large Language Model Instruction-Tuning

HyperLoRA: Efficient Cross-task Generalization Via Constrained Low-Rank Adapters Generation

Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Flexora: Flexible Low Rank Adaptation for Large Language Models

SBoRA: Low-Rank Adaptation with Regional Weight Updates

Enhancing Parameter Efficiency and Generalization in Large-Scale Models: A Regularized and Masked Low-Rank Adaptation Approach

BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models