ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Yuhang Yao,Han Jin,Alay Dilipbhai Shah,Shanshan Han,Zijian Hu,Yide Ran,Dimitris Stripelis,Zhaozhuo Xu,Salman Avestimehr,Chaoyang He
2024-09-11
Abstract:Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of improving end-to-end efficiency in large language model (LLM) services, particularly in commercial applications. Specifically, the paper focuses on the following key issues: 1. **System Latency**: Existing research mainly focuses on optimizing local inference speed but neglects the end-to-end latency of the entire system. In practical applications, the latency of functional modules such as gateways and routing becomes the main bottleneck. 2. **High Concurrent Request Handling**: Commercial LLM applications need to handle a large number of concurrent requests, and the performance of existing solutions degrades significantly under high concurrency scenarios. 3. **Resource Utilization Efficiency**: LLM inference tasks are very computationally intensive and require a large amount of computing resources. Efficiently utilizing these resources to reduce costs is another important issue. 4. **Reliability and Security**: Commercial LLM services need to have fault tolerance capabilities, inference control mechanisms, and low-latency responses to ensure a good user experience. To address these issues, the paper proposes an optimization framework called **ScaleLLM**, which aims to improve the end-to-end efficiency of LLM services through the following methods: - **Optimizing the Inference Engine**: Reducing inference latency and increasing throughput through methods such as model parallelism, quantization techniques, and continuous batching. - **Optimizing the Gateway**: Implementing the gateway using the high-performance Rust language, optimizing network I/O and CPU-intensive tasks to improve the ability to handle high concurrent requests. - **Dynamic Load Balancing**: Designing a dynamic load balancing system that selects the appropriate resource configuration based on different levels of concurrent requests to ensure high performance under varying loads. The paper validates the performance advantages of ScaleLLM under different concurrent request scenarios through detailed experiments, particularly highlighting that ScaleLLM significantly outperforms existing solutions in high concurrency scenarios.