Fast Distributed Inference Serving for Large Language Models

Bingyang Wu,Yinmin Zhong,Zili Zhang,Shengyu Liu,Fangyue Liu,Yuanhang Sun,Gang Huang,Xuanzhe Liu,Xin Jin
2024-09-25
Abstract:Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve the problem of low - latency inference service of large - language models (LLMs) in interactive AI applications. Specifically, the existing LLM service systems use the completion - then - processing method for inference task scheduling, which can lead to head - of - line blocking and long latencies. To overcome these problems, the paper proposes a new distributed inference service system - FastServe. ### Main Problems 1. **Head - of - line Blocking**: In existing systems, once a task is scheduled, it will keep running until completion. If this task is long, it will block shorter tasks that arrive later, resulting in an increase in overall latency. 2. **High Latency**: Interactive AI applications require low latency to provide a good user experience, but the complexity and scale of LLMs put great pressure on the underlying inference service infrastructure. 3. **Resource Management**: Large - scale LLM inference tasks require a large amount of GPU memory, and the existing scheduling strategies cannot effectively manage these resources, resulting in memory overflows or resource waste. ### Solutions 1. **Preemptive Scheduling**: FastServe takes advantage of the autoregressive nature of LLM inference and allows preemption at the granularity of each output token. After a scheduled task generates an output token, FastServe can decide whether to continue executing this task or preempt another task. 2. **Skip - Join Multilevel Feedback Queue (Skip - Join MLFQ) Scheduler**: This is a novel scheduler that reduces the number of demotions by skipping higher - priority queues and uses input length information to assign an appropriate initial queue to each arriving task. 3. **Efficient GPU Memory Management**: FastServe designs an active GPU memory management mechanism that can offload the state to the host memory when the cache is nearly full and reload it when needed, thus avoiding memory overflows and head - of - line blocking problems. ### Experimental Results The experimental results show that compared with the current state - of - the - art solution vLLM, FastServe improves the throughput by 31.4 times and 17.9 times respectively under the same average and tail - latency requirements. ### Summary FastServe effectively solves the head - of - line blocking and high - latency problems in LLM inference service by introducing preemptive scheduling and the Skip - Join Multilevel Feedback Queue scheduler. At the same time, it ensures the stability and performance of the system through an efficient GPU memory management mechanism. This makes FastServe perform excellently in large - scale LLM inference tasks, especially providing lower latency and higher throughput in interactive AI applications.