Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Ke Cheng,Wen Hu,Zhi Wang,Peng Du,Jianguo Li,Sheng Zhang

2024-06-07

Abstract:Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a significant number of popular applications in the LMaaS scenario have a positive correlation between the generation length and the length of raw user input. Based on this observation, we propose Magnus, which can accurately predict the request generation length with the user input length, application-level, and user-level semantic features. Accordingly, Magnus can achieve high request throughput by batching requests of similar generation lengths together with adaptive batch sizes. Besides, Magnus can also schedule batches with the highest response ratio next (HRRN) policy to reduce request response time. Experiments conducted on our testbed show that Magnus improves request throughput by up to 234\% and reduces response time by up to 89.7\% compared to baselines.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

This paper attempts to address the inefficiency in batch processing in the Language Model as a Service (LMaaS) scenario due to the unknown length of request generation. Specifically, existing service systems use a fixed batch size and First-Come-First-Serve (FCFS) method to handle requests, which leads to two main issues: 1. **Request Waiting**: In a batch, requests with short generation lengths must wait for requests with long generation lengths to complete before returning together. During this period, the completed requests still participate in computation, generating invalid tokens and causing significant computational waste. 2. **Memory Consumption**: Requests with long generation lengths generate more key-value caches, occupying more GPU memory. Since the generation length of requests cannot be predicted, existing systems typically use smaller batch sizes to avoid Out-Of-Memory (OOM) errors, thus failing to fully utilize the parallel computing capabilities of GPUs. To address these issues, the authors propose a system named Magnus, which optimizes batch processing efficiency by predicting the generation length of requests. The specific methods include: - **Generation Length Predictor**: Using user input length, application-level semantic features, and user-level semantic features, a random forest regressor is used to predict the generation length of requests. - **Adaptive Batcher**: Groups requests with similar predicted generation lengths and sets appropriate batch sizes to reduce computational waste. - **Service Time Estimator**: Uses KNN regression to estimate the service time of each queued batch. - **Batch Scheduler**: Selects target batches for processing based on the Highest Response Ratio Next (HRRN) strategy to reduce request queuing time and response time. Experimental results show that compared to existing continuous batching and model compression methods, Magnus can significantly improve request throughput (up to 234%) and reduce average service latency (up to 89.7%).

Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Multi-Bin Batching for Increasing LLM Inference Throughput

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

Efficient LLM Scheduling by Learning to Rank

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Efficient Memory Management for Large Language Model Serving with PagedAttention

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Fast distributed inference serving for large language models

Length Controlled Generation for Black-box LLMs

LLMCad: Fast and Scalable On-device Large Language Model Inference

Optimizing Microservice Deployment in Edge Computing with Large Language Models: Integrating Retrieval Augmented Generation and Chain of Thought Techniques

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving