Abstract:In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training because they must limit the update frequency to maintain optimizer state consistency, hindering their ability to closely approximate full-rank training behavior. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model while reducing communication overhead by 54\% on the LLaMA 1.3B model. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model. This demonstrates enhanced generalization and reasoning capabilities of SwitchLoRA.

Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models

Orthogonal Subspace Learning for Language Model Continual Learning

CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

LoRA Learns Less and Forgets Less

Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Learning Attentional Mixture of LoRAs for Language Model Continual Learning

Enhancing Parameter Efficiency and Generalization in Large-Scale Models: A Regularized and Masked Low-Rank Adaptation Approach

CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Structure-Aware Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

Flexora: Flexible Low Rank Adaptation for Large Language Models

Sparse Low-rank Adaptation of Pre-trained Language Models

BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models