Efficient LLM Scheduling by Learning to Rank

Yichao Fu,Siqi Zhu,Runlong Su,Aurick Qiao,Ion Stoica,Hao Zhang
2024-08-28
Abstract:In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at <a class="link-external link-https" href="https://github.com/hao-ai-lab/vllm-ltr.git" rel="external noopener nofollow">this https URL</a>
Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the scheduling problem in large-scale language model (LLM) inference services. Specifically, the authors point out that most current LLM service systems adopt a simple First-Come-First-Serve (FCFS) scheduling strategy, which leads to Head-of-Line (HOL) blocking, thereby reducing throughput and service quality. The paper proposes a new scheduling method that uses learning-to-rank techniques to predict the relative order of request generation lengths, thereby better approximating Shortest Job First (SJF) or Shortest Remaining Time First (SRTF) scheduling strategies to improve system performance. ### Main Contributions 1. **Importance of Relative Order**: The authors demonstrate that knowing the relative order of generation lengths is more important than accurately predicting the generation lengths, effectively optimizing the scheduling of LLM services. 2. **Application of Kendall’s Tau**: Using Kendall’s Tau as an effective metric to measure the similarity between predicted scheduling and ideal SJF/SRTF scheduling, where a high Kendall’s Tau usually means lower latency and higher throughput. 3. **Optimization Based on Learning to Rank**: By training a small auxiliary model (e.g., OPT-125M) to predict the order of request generation lengths, achieving low-overhead real-time scheduling. 4. **Performance Improvement**: Integrating this method into state-of-the-art LLM service systems significantly improves performance, such as reducing latency by 2.8 times in chatbot services and increasing throughput by 6.5 times in synthetic data generation tasks. ### Method Overview 1. **Problem Definition**: For a given batch of requests, define the true value of generation length \( l \) and obtain the ranking list \( r \) from it. 2. **Prediction Model**: Use a small OPT model as the predictor \( P \), mapping hidden states to floating-point scores through an additional linear layer. 3. **Training Data**: Generate complete outputs using the target LLM, obtain generation lengths, and convert them into ranking labels. 4. **Loss Function**: Optimize using the ListMLE loss function, which considers the ranking order of the entire list, providing a more comprehensive evaluation. 5. **Scheduling Algorithm**: Design a simple scheduling algorithm that arranges requests based on the predicted generation length rankings, while introducing mechanisms to prevent request starvation. ### Experimental Results 1. **Chatbot Service**: On the ShareGPT and LMSYS-Chat-1M datasets, the proposed ranking method reduces average latency by 6.9 times compared to FCFS and by 1.5-1.9 times compared to the PO method under a load of 64 requests/second. 2. **Synthetic Data Generation**: In terms of the time to generate 1000 samples and the number of samples generated within 5 minutes, the proposed ranking method significantly outperforms other methods, increasing throughput by 2.4-6.5 times and 3.2 times, respectively. ### Conclusion The paper proposes a learning-to-rank-based LLM scheduling method that significantly improves system performance by predicting the relative order of request generation lengths, reducing latency, and increasing throughput. This method is simple and easy to integrate into existing production systems.