Abstract:Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

What problem does this paper attempt to address?

This paper attempts to solve the problem of low - latency inference service of large - language models (LLMs) in interactive AI applications. Specifically, the existing LLM service systems use the completion - then - processing method for inference task scheduling, which can lead to head - of - line blocking and long latencies. To overcome these problems, the paper proposes a new distributed inference service system - FastServe. ### Main Problems 1. **Head - of - line Blocking**: In existing systems, once a task is scheduled, it will keep running until completion. If this task is long, it will block shorter tasks that arrive later, resulting in an increase in overall latency. 2. **High Latency**: Interactive AI applications require low latency to provide a good user experience, but the complexity and scale of LLMs put great pressure on the underlying inference service infrastructure. 3. **Resource Management**: Large - scale LLM inference tasks require a large amount of GPU memory, and the existing scheduling strategies cannot effectively manage these resources, resulting in memory overflows or resource waste. ### Solutions 1. **Preemptive Scheduling**: FastServe takes advantage of the autoregressive nature of LLM inference and allows preemption at the granularity of each output token. After a scheduled task generates an output token, FastServe can decide whether to continue executing this task or preempt another task. 2. **Skip - Join Multilevel Feedback Queue (Skip - Join MLFQ) Scheduler**: This is a novel scheduler that reduces the number of demotions by skipping higher - priority queues and uses input length information to assign an appropriate initial queue to each arriving task. 3. **Efficient GPU Memory Management**: FastServe designs an active GPU memory management mechanism that can offload the state to the host memory when the cache is nearly full and reload it when needed, thus avoiding memory overflows and head - of - line blocking problems. ### Experimental Results The experimental results show that compared with the current state - of - the - art solution vLLM, FastServe improves the throughput by 31.4 times and 17.9 times respectively under the same average and tail - latency requirements. ### Summary FastServe effectively solves the head - of - line blocking and high - latency problems in LLM inference service by introducing preemptive scheduling and the Skip - Join Multilevel Feedback Queue scheduler. At the same time, it ensures the stability and performance of the system through an efficient GPU memory management mechanism. This makes FastServe perform excellently in large - scale LLM inference tasks, especially providing lower latency and higher throughput in interactive AI applications.

Fast Distributed Inference Serving for Large Language Models

Fast distributed inference serving for large language models

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Fast Inference for Augmented Large Language Models

Efficient LLM inference solution on Intel GPU

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

UELLM: A Unified and Efficient Approach for LLM Inference Serving

Distributed Inference Performance Optimization for LLMs on CPUs

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Efficient Deployment of Large Language Model Across Cloud-Device Systems

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Stateful Large Language Model Serving with Pensieve

Efficient and Economic Large Language Model Inference with Attention Offloading

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Efficient LLM Scheduling by Learning to Rank

Inference Performance Optimization for Large Language Models on CPUs

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving