Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Ke Cheng,Wen Hu,Zhi Wang,Peng Du,Jianguo Li,Sheng Zhang
2024-06-07
Abstract:Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a significant number of popular applications in the LMaaS scenario have a positive correlation between the generation length and the length of raw user input. Based on this observation, we propose Magnus, which can accurately predict the request generation length with the user input length, application-level, and user-level semantic features. Accordingly, Magnus can achieve high request throughput by batching requests of similar generation lengths together with adaptive batch sizes. Besides, Magnus can also schedule batches with the highest response ratio next (HRRN) policy to reduce request response time. Experiments conducted on our testbed show that Magnus improves request throughput by up to 234\% and reduces response time by up to 89.7\% compared to baselines.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to address the inefficiency in batch processing in the Language Model as a Service (LMaaS) scenario due to the unknown length of request generation. Specifically, existing service systems use a fixed batch size and First-Come-First-Serve (FCFS) method to handle requests, which leads to two main issues: 1. **Request Waiting**: In a batch, requests with short generation lengths must wait for requests with long generation lengths to complete before returning together. During this period, the completed requests still participate in computation, generating invalid tokens and causing significant computational waste. 2. **Memory Consumption**: Requests with long generation lengths generate more key-value caches, occupying more GPU memory. Since the generation length of requests cannot be predicted, existing systems typically use smaller batch sizes to avoid Out-Of-Memory (OOM) errors, thus failing to fully utilize the parallel computing capabilities of GPUs. To address these issues, the authors propose a system named Magnus, which optimizes batch processing efficiency by predicting the generation length of requests. The specific methods include: - **Generation Length Predictor**: Using user input length, application-level semantic features, and user-level semantic features, a random forest regressor is used to predict the generation length of requests. - **Adaptive Batcher**: Groups requests with similar predicted generation lengths and sets appropriate batch sizes to reduce computational waste. - **Service Time Estimator**: Uses KNN regression to estimate the service time of each queued batch. - **Batch Scheduler**: Selects target batches for processing based on the Highest Response Ratio Next (HRRN) strategy to reduce request queuing time and response time. Experimental results show that compared to existing continuous batching and model compression methods, Magnus can significantly improve request throughput (up to 234%) and reduce average service latency (up to 89.7%).