Abstract:Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at <a class="link-external link-https" href="https://github.com/microsoft/sarathi-serve" rel="external noopener nofollow">this https URL</a>.

Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

A System for Microserving of LLMs

Responsive ML inference in multi-tenanted environments using AQUA

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

CascadeServe: Unlocking Model Cascades for Inference Serving

Inference Performance Optimization for Large Language Models on CPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

NanoFlow: Towards Optimal Large Language Model Serving Throughput

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling