Batched Low-Rank Adaptation of Foundation Models

Yeming Wen,Swarat Chaudhuri

2024-04-26

Abstract:Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently handle multiple task - specific adapters in real - time service scenarios to meet the needs of diverse users around the world. Specifically, the paper points out that although the existing Low - Rank Adaptation (LORA) method can significantly reduce the number of training parameters required for fine - tuning large - scale base models, in practical applications, due to its design limitations, it cannot effectively handle the different adapters that may be required for each request, especially in cases where personalized, task - specific adaptation is required. This has led to performance bottlenecks when handling a large number of heterogeneous requests. To overcome this limitation, the authors propose the Fast Low - Rank Adaptation (FLORA) framework. FLORA allows each input sample to be associated with unique low - rank adaptation weights in its mini - batch, thereby enabling effective batching of heterogeneous requests. In this way, FLORA not only retains the performance advantages of LORA but also demonstrates competitiveness in tasks such as multilingual code generation and multilingual speech recognition, especially when it is necessary to handle user requests from different language and occupational backgrounds.

Batched Low-Rank Adaptation of Foundation Models

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

HyperLoRA: Efficient Cross-task Generalization Via Constrained Low-Rank Adapters Generation

LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

LoRA+: Efficient Low Rank Adaptation of Large Models

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Sparse Low-rank Adaptation of Pre-trained Language Models

Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation

Structure-Aware Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

LoRA Learns Less and Forgets Less

ResLoRA: Identity Residual Mapping in Low-Rank Adaption

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models

SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning

Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters