Abstract:In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at <a class="link-external link-https" href="https://github.com/hao-ai-lab/vllm-ltr.git" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the scheduling problem in large-scale language model (LLM) inference services. Specifically, the authors point out that most current LLM service systems adopt a simple First-Come-First-Serve (FCFS) scheduling strategy, which leads to Head-of-Line (HOL) blocking, thereby reducing throughput and service quality. The paper proposes a new scheduling method that uses learning-to-rank techniques to predict the relative order of request generation lengths, thereby better approximating Shortest Job First (SJF) or Shortest Remaining Time First (SRTF) scheduling strategies to improve system performance. ### Main Contributions 1. **Importance of Relative Order**: The authors demonstrate that knowing the relative order of generation lengths is more important than accurately predicting the generation lengths, effectively optimizing the scheduling of LLM services. 2. **Application of Kendall’s Tau**: Using Kendall’s Tau as an effective metric to measure the similarity between predicted scheduling and ideal SJF/SRTF scheduling, where a high Kendall’s Tau usually means lower latency and higher throughput. 3. **Optimization Based on Learning to Rank**: By training a small auxiliary model (e.g., OPT-125M) to predict the order of request generation lengths, achieving low-overhead real-time scheduling. 4. **Performance Improvement**: Integrating this method into state-of-the-art LLM service systems significantly improves performance, such as reducing latency by 2.8 times in chatbot services and increasing throughput by 6.5 times in synthetic data generation tasks. ### Method Overview 1. **Problem Definition**: For a given batch of requests, define the true value of generation length \( l \) and obtain the ranking list \( r \) from it. 2. **Prediction Model**: Use a small OPT model as the predictor \( P \), mapping hidden states to floating-point scores through an additional linear layer. 3. **Training Data**: Generate complete outputs using the target LLM, obtain generation lengths, and convert them into ranking labels. 4. **Loss Function**: Optimize using the ListMLE loss function, which considers the ranking order of the entire list, providing a more comprehensive evaluation. 5. **Scheduling Algorithm**: Design a simple scheduling algorithm that arranges requests based on the predicted generation length rankings, while introducing mechanisms to prevent request starvation. ### Experimental Results 1. **Chatbot Service**: On the ShareGPT and LMSYS-Chat-1M datasets, the proposed ranking method reduces average latency by 6.9 times compared to FCFS and by 1.5-1.9 times compared to the PO method under a load of 64 requests/second. 2. **Synthetic Data Generation**: In terms of the time to generate 1000 samples and the number of samples generated within 5 minutes, the proposed ranking method significantly outperforms other methods, increasing throughput by 2.4-6.5 times and 3.2 times, respectively. ### Conclusion The paper proposes a learning-to-rank-based LLM scheduling method that significantly improves system performance by predicting the relative order of request generation lengths, reducing latency, and increasing throughput. This method is simple and easy to integrate into existing production systems.

Efficient LLM Scheduling by Learning to Rank

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Fast Inference for Augmented Large Language Models

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Llumnix: Dynamic Scheduling for Large Language Model Serving

Fairness in Serving Large Language Models

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Don't Stop Me Now: Embedding Based Scheduling for LLMs

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Fast Distributed Inference Serving for Large Language Models

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

UELLM: A Unified and Efficient Approach for LLM Inference Serving

LLMs can Schedule

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Multi-Bin Batching for Increasing LLM Inference Throughput