S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Ying Sheng,Shiyi Cao,Dacheng Li,Coleman Hooper,Nicholas Lee,Shuo Yang,Christopher Chou,Banghua Zhu,Lianmin Zheng,Kurt Keutzer,Joseph E. Gonzalez,Ion Stoica
2024-06-05
Abstract:The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at <a class="link-external link-https" href="https://github.com/S-LoRA/S-LoRA" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently serve thousands of LoRA adapters on a single machine. Specifically, the paper focuses on how to optimize memory management and computation scheduling to achieve efficient parallel service of large - scale LoRA adapters without significantly increasing hardware costs. The following are the key issues mentioned in the paper: 1. **Memory Management Challenges**: - **Memory Fragmentation**: Due to the different sizes of different LoRA adapters and the dynamic changes of KV caches, directly loading and unloading these adapters will lead to severe memory fragmentation. - **I/O Overhead**: Frequent adapter loading and unloading will introduce large I/O latencies, affecting the overall performance of the system. 2. **Computation Scheduling Challenges**: - **Heterogeneous Batching**: Different LoRA adapters have different ranks and sequence lengths, and an efficient batching strategy is required to minimize the computation overhead. - **Multi - GPU Parallelism**: In a multi - GPU environment, how to effectively manage communication and memory overhead to support the parallel computation of large - scale LoRA adapters. 3. **Limitations of Existing Methods**: - **Weight Merging**: Although directly merging the weights of LoRA adapters into the base model can reduce the overhead during inference, in the case where multiple adapters exist simultaneously, this method will lead to a large number of weight copies and missed batching opportunities. - **Dynamic Loading**: Existing methods such as dynamic loading and unloading of adapters, although they can reduce memory usage, will introduce additional latency and fragmentation problems. To address these issues, the paper proposes the S - LoRA system, with the following main contributions: 1. **Unified Paging**: - A unified memory pool is introduced to manage the dynamic adapter weights and KV caches, reducing memory fragmentation and supporting efficient storage and access in non - continuous memory. 2. **Heterogeneous Batching**: - Highly optimized custom CUDA kernels are developed to support the batch computation of LoRA adapters with different ranks and sequence lengths, avoiding unnecessary padding operations and improving hardware utilization. 3. **S - LoRA Tensor Parallelism (S - LoRA TP)**: - A new tensor parallel strategy is designed to minimize communication and memory overhead in a multi - GPU environment, supporting the efficient parallel computation of large - scale LoRA adapters. Through these innovations, S - LoRA can efficiently serve thousands of LoRA adapters on a single GPU or multiple GPUs. Compared with existing state - of - the - art libraries (such as HuggingFace PEFT and vLLM), S - LoRA can significantly improve throughput and the number of serviceable adapters.