Abstract:Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.

What problem does this paper attempt to address?

The paper attempts to address the issue of improving end-to-end efficiency in large language model (LLM) services, particularly in commercial applications. Specifically, the paper focuses on the following key issues: 1. **System Latency**: Existing research mainly focuses on optimizing local inference speed but neglects the end-to-end latency of the entire system. In practical applications, the latency of functional modules such as gateways and routing becomes the main bottleneck. 2. **High Concurrent Request Handling**: Commercial LLM applications need to handle a large number of concurrent requests, and the performance of existing solutions degrades significantly under high concurrency scenarios. 3. **Resource Utilization Efficiency**: LLM inference tasks are very computationally intensive and require a large amount of computing resources. Efficiently utilizing these resources to reduce costs is another important issue. 4. **Reliability and Security**: Commercial LLM services need to have fault tolerance capabilities, inference control mechanisms, and low-latency responses to ensure a good user experience. To address these issues, the paper proposes an optimization framework called **ScaleLLM**, which aims to improve the end-to-end efficiency of LLM services through the following methods: - **Optimizing the Inference Engine**: Reducing inference latency and increasing throughput through methods such as model parallelism, quantization techniques, and continuous batching. - **Optimizing the Gateway**: Implementing the gateway using the high-performance Rust language, optimizing network I/O and CPU-intensive tasks to improve the ability to handle high concurrent requests. - **Dynamic Load Balancing**: Designing a dynamic load balancing system that selects the appropriate resource configuration based on different levels of concurrent requests to ensure high performance under varying loads. The paper validates the performance advantages of ScaleLLM under different concurrent request scenarios through detailed experiments, particularly highlighting that ScaleLLM significantly outperforms existing solutions in high concurrency scenarios.

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Efficient LLM Scheduling by Learning to Rank

OptLLM: Optimal Assignment of Queries to Large Language Models

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

New Solutions on LLM Acceleration, Optimization, and Application

Efficient and Economic Large Language Model Inference with Attention Offloading

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Llumnix: Dynamic Scheduling for Large Language Model Serving

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

UELLM: A Unified and Efficient Approach for LLM Inference Serving

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Towards Pareto Optimal Throughput in Small Language Model Serving

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficient Memory Management for Large Language Model Serving with PagedAttention

Search for Efficient Large Language Models