Abstract:The rapid expansion of large language models (LLMs) has underscored the need for parameter-efficient fine-tuning methods, with LoRA (Low-Rank Adaptation) emerging as a popular solution. Although LoRA reduces the number of trainable parameters, serving multiple (task or user-specific) LoRA modules on top of a base model still creates significant storage challenges. To address this, using theoretical derivation, we introduce LoRA-XS (Low-Rank Adaptation with eXtremely Small number of parameters), a novel low-rank adaptation method that considerably reduces the trainable parameters while showing superior or competitive performance. LoRA-XS achieves this by inserting a small, trainable r x r weight matrix between frozen low-rank matrices, which are constructed by Singular Value Decomposition (SVD) of the original weight matrix. This lightweight matrix enables fine-tuning with drastically reduced storage requirements, making it feasible to deploy millions of personalized models while minimizing memory overhead. For instance, LoRA-XS achieves a remarkable reduction of trainable parameters by over 100x in 7B models compared to LoRA. Our evaluations across various benchmarks (including GLUE, GSM8K, MATH, and eight commonsense reasoning datasets) demonstrate that LoRA-XS performs competitively or better than LoRA and other recent methods like VeRA while being significantly more parameter efficient. We also provide an extensive ablation study on the importance of singular vectors in transformer weights, shedding light on the underlying mechanisms driving LoRA-XS's enhanced efficiency. These findings suggest that LoRA-XS is not only a storage-efficient alternative, but also a powerful tool for scaling and personalizing LLMs at unprecedented scales.

Scaling Optimal LR Across Token Horizons

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

Temporal Scaling Law for Large Language Models

Language models scale reliably with over-training and on downstream tasks

Scaling Laws for Downstream Task Performance of Large Language Models

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Scaling Exponents Across Parameterizations and Optimizers

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Optimization Hyper-parameter Laws for Large Language Models

Tokenizer Choice For LLM Training: Negligible or Crucial?

Where Do Large Learning Rates Lead Us?

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Scaling Law for Language Models Training Considering Batch Size

Hyperbolic Fine-tuning for Large Language Models

Scaling Law with Learning Rate Annealing

Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws