Learned Best-Effort LLM Serving

Siddharth Jha,Coleman Hooper,Xiaoxuan Liu,Sehoon Kim,Kurt Keutzer
2024-07-15
Abstract:Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.
Machine Learning,Artificial Intelligence,Computation and Language,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to provide users with low - latency large - scale language model (LLM) services without increasing hardware costs. Specifically, many applications need to provide low - latency LLM services to avoid unacceptable user experiences. However, over - provisioning resources to cope with fluctuations in request patterns is usually prohibitive (i.e., too expensive). Therefore, the authors propose a best - effort service system based on deep reinforcement learning, which can dynamically adjust the service quality according to the task distribution and system load. ### Specific description of the problem 1. **Low - latency requirement**: Many applications need to ensure low - latency service quality, otherwise it will lead to a poor user experience. 2. **Resource overload and cost**: To deal with sudden request peaks, simply increasing GPU resources to run models in parallel is very expensive, especially for small enterprises and independent developers. 3. **Limitations of existing solutions**: - Using a smaller model can reduce latency, but it will significantly reduce the quality. - Static resource allocation methods cannot flexibly respond to changes in request patterns. ### Solution proposed in the paper The authors propose a best - effort service framework based on deep reinforcement learning, which can dynamically select models of different sizes to match client requests. Main features include: - **Dynamic routing mechanism**: Dynamically select the most suitable model according to the current task and system load. - **Multiple models working together**: Multiple models of different sizes work together to balance accuracy and response time. - **Deep reinforcement learning**: Use the DQN algorithm to optimize the request routing strategy and maximize the cumulative performance. - **High adaptability**: The system can adapt to different task distributions and load changes and maintain high availability and performance. ### Main objectives - **Improve performance**: Compared with static service methods, under unpredictable workloads, the best - effort service can reach peak performance above 96% more frequently, and the frequency of exceeding 98% peak performance is also higher. - **Improve availability**: The best - effort service can meet the client's deadline requirements under a system load 10 times higher than that of static services. - **Improve cost - effectiveness**: Compared with a static service system using twice as many GPUs, the best - effort service can still exceed 90% of the peak performance at a higher frequency, and the performance per unit of hardware is increased by 3.94 times. Through these improvements, the paper shows how to effectively provide high - quality, low - latency LLM services with limited hardware resources.