Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Ferdi Kossmann,Bruce Fontaine,Daya Khudia,Michael Cafarella,Samuel Madden
2024-10-23
Abstract:Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each server, an engine-level scheduler decides when to run a request, or when to queue or preempt it. Improved scheduling policies may benefit a wide range of LLM deployments and can often be implemented as "drop-in replacements" to a system's current policy. In this work, we survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure. This finding motivates us to introduce two new scheduling techniques, which are both easy to implement, and outperform current techniques on production workload traces.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the request scheduling problem in large - language - model (LLM) service systems. Specifically, the author focuses on how to optimize the use of GPU resources in the case of multi - request concurrent processing, in order to increase throughput and reduce latency and preemption rate. ### Problem Background 1. **Concurrent Processing and Scheduling Decisions**: - LLM service systems improve throughput by concurrently processing multiple requests. - When multiple requests are concurrently processed, the multiplexing of hardware resources (such as GPU memory) involves non - trivial scheduling decisions. 2. **Memory Consumption of KV Cache**: - The KV cache of each request will occupy a large amount of GPU memory, and it increases as the request sequence grows. - When the KV cache exceeds the GPU memory capacity, the system must preempt certain requests, which will lead to additional overhead (such as recalculating the KV cache). 3. **Limitations of Existing Systems**: - Although existing schedulers can achieve a certain performance improvement, they often introduce complexity or cannot fully utilize hardware resources. - In actual deployment, many systems often leave room for easily achievable performance improvement in order to simplify the implementation. ### Main Contributions of the Paper 1. **Literature Review and Comparison**: - The author has conducted a comprehensive review of existing scheduling techniques and compared the performance of these techniques in actual systems. - It is found that schedulers from the literature usually have good performance but introduce significant complexity; while schedulers in actual deployment are relatively simple but have limited performance improvement. 2. **Proposing New Scheduling Techniques**: - Two new scheduling techniques are proposed: LARRY (engine - level scheduler) and SAL (load balancer), which are both easy to implement and can outperform existing techniques under production workloads. ### Details of New Scheduling Techniques 1. **LARRY (Load - Adaptive Request Reordering for Low Latency)**: - Reorder the requests in the waiting queue according to the expected memory requirements of the requests and the current system load. - Under high memory pressure, requests with small memory consumption are processed preferentially; under low memory pressure, requests with large memory consumption are allowed to run. - Use a simple formula (see Formula 1) to score each request and sort according to the score: \[ \text{score}(r)=\text{queue\_len}\times\text{memory}(r)+\alpha\times\text{waiting\_time} \] where: - \(\text{queue\_len}\) is the length of the waiting queue. - \(\text{memory}(r)\) is the expected memory consumption of request \(r\). - \(\alpha\) is a weighting parameter used to balance the influence of waiting time and memory consumption. 2. **SAL (Server - Aware Load Balancer)**: - Considering the current load of the server (including queued pre - filled tokens and available memory), route requests to the server with the lowest load. - Use a formula (see Formula 2) to quantify the load of each server: \[ \text{load}(s, r)=\max\left(\beta\times(\text{memory}(r)-\text{free\_mem}(s)),\frac{\text{queued\_tokens}(s, r)}{\text{max\_tokens\_per\_batch}}\right) \] where: - \(\beta = \frac{\mu_{\text{in}}+\mu_{\text{out}}}{\mu_{\text{out}}}\) is an approximation of the memory release rate. - \(\mu_{\text{in}}\) and \(\mu_{\text{out}}\)