Batched Low-Rank Adaptation of Foundation Models

Yeming Wen,Swarat Chaudhuri
2024-04-26
Abstract:Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently handle multiple task - specific adapters in real - time service scenarios to meet the needs of diverse users around the world. Specifically, the paper points out that although the existing Low - Rank Adaptation (LORA) method can significantly reduce the number of training parameters required for fine - tuning large - scale base models, in practical applications, due to its design limitations, it cannot effectively handle the different adapters that may be required for each request, especially in cases where personalized, task - specific adaptation is required. This has led to performance bottlenecks when handling a large number of heterogeneous requests. To overcome this limitation, the authors propose the Fast Low - Rank Adaptation (FLORA) framework. FLORA allows each input sample to be associated with unique low - rank adaptation weights in its mini - batch, thereby enabling effective batching of heterogeneous requests. In this way, FLORA not only retains the performance advantages of LORA but also demonstrates competitiveness in tasks such as multilingual code generation and multilingual speech recognition, especially when it is necessary to handle user requests from different language and occupational backgrounds.