Abstract:The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at <a class="link-external link-https" href="https://github.com/S-LoRA/S-LoRA" rel="external noopener nofollow">this https URL</a>

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Llumnix: Dynamic Scheduling for Large Language Model Serving

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Efficient LLM Scheduling by Learning to Rank

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Fast Distributed Inference Serving for Large Language Models

One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning