Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Chengyi Nie,Rodrigo Fonseca,Zhenhua Liu
DOI: https://doi.org/10.48550/arXiv.2405.06856
2024-05-11
Abstract:The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the cost - effectiveness optimization of large - language - model (LLM) inference services. Specifically, existing work mainly focuses on the optimization of a single worker node and lacks management and optimization at the entire cluster level. This leads to easy violations of service - level objectives (SLOs) or insufficient resource utilization when handling inference requests and service resources without considering query characteristics. This forces service providers to allocate additional computing resources to ensure the user experience, thereby increasing the service cost. The paper proposes a new scheduler - Aladdin, which can jointly and adaptively place queries and adjust computing resources while taking SLOs into account. For a series of inference queries, Aladdin first predicts the minimum computing resources required to meet the SLOs of all queries and their corresponding service - worker - node configurations. Then, according to the pre - filling and decoding latency models of batch LLM inference, the queries are assigned to each service - worker - node to maximize the utilization rate of each worker node. Experimental results show that, compared with the baseline, Aladdin can reduce the service cost of a single model by up to 71%, which is equivalent to saving millions of dollars per year. In summary, this paper aims to solve the problems of high cost, insufficient resource utilization, and difficulty in meeting SLOs in existing LLM inference services by introducing a new scheduling mechanism.