Abstract:A service-level objective (SLO) is a target performance metric of service that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user satisfaction and improve the competitiveness of cloud vendors. As large language models (LLMs) are gaining increasing popularity across various fields, it is of great significance to optimize SLOs for LLM inference services. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. We first propose a generalized formulation of the tuning problem to handle various objectives and constraints between parameters, and SCOOT exploits the Bayesian optimization (BO) technique to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT can significantly outperform existing tuning techniques in SLO optimization while greatly improving the tuning efficiency.

What problem does this paper attempt to address?

The paper primarily addresses the issue of optimizing the service performance of large language models (LLMs), particularly by adjusting the parameters of LLM inference engines to optimize service level objectives (SLOs). Below is a summary of the core issues the paper attempts to solve: ### Research Background and Motivation - **LLM Inference Engine and Parameters**: The paper mentions that the inference engine integrates advanced technologies such as continuous batching and paged attention, and exposes many adjustable parameters that can change request scheduling strategies or memory allocation strategies. - **Service Level Objectives (SLOs) and Stress Testing**: To ensure service quality, customers usually agree on SLOs with cloud providers. If SLOs are violated, providers not only have to compensate but may also suffer reputational damage. Cloud providers typically determine SLOs through stress testing. ### Experimental Observations - **Importance of Parameter Adjustment**: Experiments show that adjusting the parameter configuration of the inference engine can significantly improve service performance. For example, time to first token (TTFT) and time per output token (TPOT) can be reduced by up to 98.9% and 49.9%, respectively. - **Differences in Optimal Configurations for Different Services**: There is no "best practice" configuration that applies to all scenarios; the optimal parameter configuration varies for different services. ### Challenges - **Diverse Optimization Goals**: Different application requirements determine that customers want to optimize different performance metrics. For example, non-interactive applications may focus more on throughput, while interactive applications tend to minimize both TTFT and TPOT. - **Complex Known and Hidden Constraints**: There are dependencies between parameters, forming known constraints. Additionally, in certain specific services, some parameter combinations can cause the inference engine to crash, known as hidden constraints. - **High Evaluation Costs**: Each evaluation (i.e., stress test) takes several minutes. For a large number of requests or larger-scale LLMs, the required time will be longer. ### Contributions of the Paper The paper proposes a system named SCOOT (ServiCe-level Objective Oriented performance Tuning system) aimed at automatically adjusting the parameters of the LLM inference engine to optimize SLOs. - **General Problem Formulation**: The paper provides a problem formulation method that supports multiple optimization goals and complex constraints. - **Bayesian Optimization**: Utilizes single-objective Bayesian optimization (SOBO) and multi-objective Bayesian optimization (MOBO) to search for the optimal parameter configuration. - **Random Forest Regression**: Used to learn hidden constraints and avoid invalid exploration. - **Parallel Suggestions**: Improves tuning efficiency by fully utilizing idle computing resources. - **Experimental Validation**: Extensive experiments validate the superiority of SCOOT in optimizing LLM inference engines, including performance across different LLMs, computing resources, and request patterns. In summary, the main purpose of this paper is to address the problem of optimizing SLOs by automatically adjusting the parameters of the LLM inference engine, and it proposes a series of methods and techniques to tackle this challenge.

Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning

Revisiting SLO and Goodput Metrics in LLM Serving

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

SIBO: A Simple Booster for Parameter-Efficient Fine-Tuning

FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning with Partitioning and Parallelism of Search Space

LLaMoCo: Instruction Tuning of Large Language Models for Optimization Code Generation

LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning

Solving General Natural-Language-Description Optimization Problems with Large Language Models

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Sequential Large Language Model-Based Hyper-Parameter Optimization

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

OptLLM: Optimal Assignment of Queries to Large Language Models

Towards Pareto Optimal Throughput in Small Language Model Serving

Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference