Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Chengyi Nie,Rodrigo Fonseca,Zhenhua Liu

DOI: https://doi.org/10.48550/arXiv.2405.06856

2024-05-11

Abstract:The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the cost - effectiveness optimization of large - language - model (LLM) inference services. Specifically, existing work mainly focuses on the optimization of a single worker node and lacks management and optimization at the entire cluster level. This leads to easy violations of service - level objectives (SLOs) or insufficient resource utilization when handling inference requests and service resources without considering query characteristics. This forces service providers to allocate additional computing resources to ensure the user experience, thereby increasing the service cost. The paper proposes a new scheduler - Aladdin, which can jointly and adaptively place queries and adjust computing resources while taking SLOs into account. For a series of inference queries, Aladdin first predicts the minimum computing resources required to meet the SLOs of all queries and their corresponding service - worker - node configurations. Then, according to the pre - filling and decoding latency models of batch LLM inference, the queries are assigned to each service - worker - node to maximize the utilization rate of each worker node. Experimental results show that, compared with the baseline, Aladdin can reduce the service cost of a single model by up to 71%, which is equivalent to saving millions of dollars per year. In summary, this paper aims to solve the problems of high cost, insufficient resource utilization, and difficulty in meeting SLOs in existing LLM inference services by introducing a new scheduling mechanism.

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Llumnix: Dynamic Scheduling for Large Language Model Serving

Efficient LLM Scheduling by Learning to Rank

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

A System for Microserving of LLMs

UELLM: A Unified and Efficient Approach for LLM Inference Serving

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

Fast Inference for Augmented Large Language Models

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling

Fast Distributed Inference Serving for Large Language Models