Abstract:Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each server, an engine-level scheduler decides when to run a request, or when to queue or preempt it. Improved scheduling policies may benefit a wide range of LLM deployments and can often be implemented as "drop-in replacements" to a system's current policy. In this work, we survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure. This finding motivates us to introduce two new scheduling techniques, which are both easy to implement, and outperform current techniques on production workload traces.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the request scheduling problem in large - language - model (LLM) service systems. Specifically, the author focuses on how to optimize the use of GPU resources in the case of multi - request concurrent processing, in order to increase throughput and reduce latency and preemption rate. ### Problem Background 1. **Concurrent Processing and Scheduling Decisions**: - LLM service systems improve throughput by concurrently processing multiple requests. - When multiple requests are concurrently processed, the multiplexing of hardware resources (such as GPU memory) involves non - trivial scheduling decisions. 2. **Memory Consumption of KV Cache**: - The KV cache of each request will occupy a large amount of GPU memory, and it increases as the request sequence grows. - When the KV cache exceeds the GPU memory capacity, the system must preempt certain requests, which will lead to additional overhead (such as recalculating the KV cache). 3. **Limitations of Existing Systems**: - Although existing schedulers can achieve a certain performance improvement, they often introduce complexity or cannot fully utilize hardware resources. - In actual deployment, many systems often leave room for easily achievable performance improvement in order to simplify the implementation. ### Main Contributions of the Paper 1. **Literature Review and Comparison**: - The author has conducted a comprehensive review of existing scheduling techniques and compared the performance of these techniques in actual systems. - It is found that schedulers from the literature usually have good performance but introduce significant complexity; while schedulers in actual deployment are relatively simple but have limited performance improvement. 2. **Proposing New Scheduling Techniques**: - Two new scheduling techniques are proposed: LARRY (engine - level scheduler) and SAL (load balancer), which are both easy to implement and can outperform existing techniques under production workloads. ### Details of New Scheduling Techniques 1. **LARRY (Load - Adaptive Request Reordering for Low Latency)**: - Reorder the requests in the waiting queue according to the expected memory requirements of the requests and the current system load. - Under high memory pressure, requests with small memory consumption are processed preferentially; under low memory pressure, requests with large memory consumption are allowed to run. - Use a simple formula (see Formula 1) to score each request and sort according to the score: \[ \text{score}(r)=\text{queue\_len}\times\text{memory}(r)+\alpha\times\text{waiting\_time} \] where: - \(\text{queue\_len}\) is the length of the waiting queue. - \(\text{memory}(r)\) is the expected memory consumption of request \(r\). - \(\alpha\) is a weighting parameter used to balance the influence of waiting time and memory consumption. 2. **SAL (Server - Aware Load Balancer)**: - Considering the current load of the server (including queued pre - filled tokens and available memory), route requests to the server with the lowest load. - Use a formula (see Formula 2) to quantify the load of each server: \[ \text{load}(s, r)=\max\left(\beta\times(\text{memory}(r)-\text{free\_mem}(s)),\frac{\text{queued\_tokens}(s, r)}{\text{max\_tokens\_per\_batch}}\right) \] where: - \(\beta = \frac{\mu_{\text{in}}+\mu_{\text{out}}}{\mu_{\text{out}}}\) is an approximation of the memory release rate. - \(\mu_{\text{in}}\) and \(\mu_{\text{out}}\)

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Efficient LLM Scheduling by Learning to Rank

The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

Llumnix: Dynamic Scheduling for Large Language Model Serving

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Fairness in Serving Large Language Models

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Don't Stop Me Now: Embedding Based Scheduling for LLMs

Fast Inference for Augmented Large Language Models

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large Language Models for Power Scheduling: A User-Centric Approach

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

LLMs can Schedule

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution