Abstract:Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.

Scaling Laws for Forgetting When Fine-Tuning Large Language Models

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models

Revisiting Catastrophic Forgetting in Large Language Model Tuning

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

LoRA Learns Less and Forgets Less

Chained Tuning Leads to Biased Forgetting

Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning

Exploring Forgetting in Large Language Model Pre-Training

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

Demystifying Language Model Forgetting with Low-rank Example Associations

Scaling Laws for Downstream Task Performance of Large Language Models

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Dissecting Learning and Forgetting in Language Model Finetuning

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation