Abstract:Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant interference problem that occurs when different types of inference requests (such as pre - fill and decode requests) are run in a mixed manner in large - language - model (LLM) inference services. Specifically, the paper points out that current LLM deployment practices usually overlook the different characteristics of the pre - fill and decode stages, resulting in a severe performance degradation when handling mixed downstream workloads. For example: 1. **Pre - fill vs. Pre - fill**: When multiple pre - fill requests are run simultaneously, it will lead to a severe performance degradation, especially when the number of pre - fill requests increases, the performance degradation is more obvious. 2. **Pre - fill vs. Decode**: Pre - fill and decode requests running simultaneously will affect each other, resulting in a performance degradation of both. 3. **Decode vs. Decode**: When decode requests of different lengths are run in a mixed manner, it will lead to a decrease in throughput because the system cannot effectively manage the use of memory bandwidth and capacity, thus causing contention and head - of - line blocking problems. To solve these problems, the paper proposes a system named TetriInfer, which optimizes LLM inference services through the following three main strategies: 1. **Fixed - size pre - fill blocks**: Divide the input prompt into fixed - size blocks to ensure that the accelerator is always close to its computational saturation limit, thereby avoiding interference in the pre - fill stage. 2. **Separate pre - fill and decode instances**: Separate the pre - fill and decode stages so that each stage can run independently, reducing mutual interference. 3. **Intelligent two - level scheduling algorithm**: Combine the predicted resource usage and use an intelligent two - level scheduling algorithm to avoid scheduling hotspots for decode requests. These strategies aim to improve time - to - first - token (TTFT), job completion time (JCT), and performance per dollar (perf/$), thereby significantly improving efficiency when handling mixed workloads. Experimental results show that TetriInfer has significant improvements in these aspects compared to existing systems.

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Efficient and Economic Large Language Model Inference with Attention Offloading

Distributed Inference Performance Optimization for LLMs on CPUs

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Inference Performance Optimization for Large Language Models on CPUs

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

UELLM: A Unified and Efficient Approach for LLM Inference Serving

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Efficient LLM inference solution on Intel GPU

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

Collaborative Inference for Large Models with Task Offloading and Early Exiting

A System for Microserving of LLMs