Abstract:The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at <a class="link-external link-https" href="https://github.com/S-LoRA/S-LoRA" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently serve thousands of LoRA adapters on a single machine. Specifically, the paper focuses on how to optimize memory management and computation scheduling to achieve efficient parallel service of large - scale LoRA adapters without significantly increasing hardware costs. The following are the key issues mentioned in the paper: 1. **Memory Management Challenges**: - **Memory Fragmentation**: Due to the different sizes of different LoRA adapters and the dynamic changes of KV caches, directly loading and unloading these adapters will lead to severe memory fragmentation. - **I/O Overhead**: Frequent adapter loading and unloading will introduce large I/O latencies, affecting the overall performance of the system. 2. **Computation Scheduling Challenges**: - **Heterogeneous Batching**: Different LoRA adapters have different ranks and sequence lengths, and an efficient batching strategy is required to minimize the computation overhead. - **Multi - GPU Parallelism**: In a multi - GPU environment, how to effectively manage communication and memory overhead to support the parallel computation of large - scale LoRA adapters. 3. **Limitations of Existing Methods**: - **Weight Merging**: Although directly merging the weights of LoRA adapters into the base model can reduce the overhead during inference, in the case where multiple adapters exist simultaneously, this method will lead to a large number of weight copies and missed batching opportunities. - **Dynamic Loading**: Existing methods such as dynamic loading and unloading of adapters, although they can reduce memory usage, will introduce additional latency and fragmentation problems. To address these issues, the paper proposes the S - LoRA system, with the following main contributions: 1. **Unified Paging**: - A unified memory pool is introduced to manage the dynamic adapter weights and KV caches, reducing memory fragmentation and supporting efficient storage and access in non - continuous memory. 2. **Heterogeneous Batching**: - Highly optimized custom CUDA kernels are developed to support the batch computation of LoRA adapters with different ranks and sequence lengths, avoiding unnecessary padding operations and improving hardware utilization. 3. **S - LoRA Tensor Parallelism (S - LoRA TP)**: - A new tensor parallel strategy is designed to minimize communication and memory overhead in a multi - GPU environment, supporting the efficient parallel computation of large - scale LoRA adapters. Through these innovations, S - LoRA can efficiently serve thousands of LoRA adapters on a single GPU or multiple GPUs. Compared with existing state - of - the - art libraries (such as HuggingFace PEFT and vLLM), S - LoRA can significantly improve throughput and the number of serviceable adapters.

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

Dlora: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving.

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

Batched Low-Rank Adaptation of Foundation Models

V-LoRA: an Efficient and Flexible System Boosts Vision Applications with LoRA LMM

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models

Federated Fine-tuning of Large Language Models under Heterogeneous Tasks and Client Resources

LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

Sparse High Rank Adapters

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

LoRA: Low-Rank Adaptation of Large Language Models