Abstract:Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to provide users with low - latency large - scale language model (LLM) services without increasing hardware costs. Specifically, many applications need to provide low - latency LLM services to avoid unacceptable user experiences. However, over - provisioning resources to cope with fluctuations in request patterns is usually prohibitive (i.e., too expensive). Therefore, the authors propose a best - effort service system based on deep reinforcement learning, which can dynamically adjust the service quality according to the task distribution and system load. ### Specific description of the problem 1. **Low - latency requirement**: Many applications need to ensure low - latency service quality, otherwise it will lead to a poor user experience. 2. **Resource overload and cost**: To deal with sudden request peaks, simply increasing GPU resources to run models in parallel is very expensive, especially for small enterprises and independent developers. 3. **Limitations of existing solutions**: - Using a smaller model can reduce latency, but it will significantly reduce the quality. - Static resource allocation methods cannot flexibly respond to changes in request patterns. ### Solution proposed in the paper The authors propose a best - effort service framework based on deep reinforcement learning, which can dynamically select models of different sizes to match client requests. Main features include: - **Dynamic routing mechanism**: Dynamically select the most suitable model according to the current task and system load. - **Multiple models working together**: Multiple models of different sizes work together to balance accuracy and response time. - **Deep reinforcement learning**: Use the DQN algorithm to optimize the request routing strategy and maximize the cumulative performance. - **High adaptability**: The system can adapt to different task distributions and load changes and maintain high availability and performance. ### Main objectives - **Improve performance**: Compared with static service methods, under unpredictable workloads, the best - effort service can reach peak performance above 96% more frequently, and the frequency of exceeding 98% peak performance is also higher. - **Improve availability**: The best - effort service can meet the client's deadline requirements under a system load 10 times higher than that of static services. - **Improve cost - effectiveness**: Compared with a static service system using twice as many GPUs, the best - effort service can still exceed 90% of the peak performance at a higher frequency, and the performance per unit of hardware is increased by 3.94 times. Through these improvements, the paper shows how to effectively provide high - quality, low - latency LLM services with limited hardware resources.

Learned Best-Effort LLM Serving

Efficient LLM Scheduling by Learning to Rank

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Ensuring Fair LLM Serving Amid Diverse Applications

Learning to Branch: Accelerating Resource Allocation in Wireless Networks

RouteLLM: Learning to Route LLMs with Preference Data

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

LSRAM: A Lightweight Autoscaling and SLO Resource Allocation Framework for Microservices Based on Gradient Descent

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Llumnix: Dynamic Scheduling for Large Language Model Serving

Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data

Deep Reinforcement Learning based Approach for Online Service Placement and Computation Resource Allocation in Edge Computing

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Mobile Edge Computing Networks: Online Low-Latency and Fresh Service Provisioning

Preble: Efficient Distributed Prompt Scheduling for LLM Serving