Abstract:Many LLM tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix. The KV context that is about to be reused may prematurely be evicted with the implicit cache management. Even if not evicted, the lifetime of the shared KV context is extended since requests sharing the same context are not scheduled together, resulting in larger memory usage. These streaming oriented systems schedule the requests in the first-come-first-serve or similar order. As a result, the requests with larger ratio of decoding steps may be scheduled too late to be able to mix with the prefill chunks to increase the hardware utilization. Besides, the token and request number based batching can limit the size of token-batch, which keeps the GPU from saturating for the iterations dominated by decoding tokens. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best, which also shrinks the lifetime of common KV memory. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM by 1.1x to 2x on a set of microbenchmarks and two typical industry workloads.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "BatchLLM: Optimizing Large - scale Batch LLM Inference through Global Prefix Sharing and Throughput - Oriented Token Batching" aims to solve several key problems in existing large - scale language model (LLM) inference engines when handling large - scale batch tasks: 1. **Lack of global prefix sharing**: - Existing LLM inference engines mainly optimize streaming online services and use the LRU cache strategy to reuse the common prefix KV context between different requests. However, when handling large - scale batch tasks, this implicit cache management may cause the KV context about to be reused to be evicted prematurely, thus increasing unnecessary recalculation. In addition, since requests sharing the same prefix are not scheduled together, the lifespan of the shared KV context is extended, increasing memory usage. 2. **Sub - optimal Token batching**: - Current LLM inference engines usually schedule different requests independently according to the arrival order of requests and do not cluster requests with a common prefix for scheduling. This will not only cause premature eviction of the shared KV context but also extend the lifespan of the shared KV memory, exacerbating the problem of large KV memory consumption. Moreover, these system designs tend to support online services, and the Token batching formed at each iteration is limited by the arrival order of requests, which may lead to sub - optimal Token batching. For example, the simple scheduling in Figure 1 cannot mix the decoding of request 3 with the pre - filling blocks of other requests. Conversely, if request 3 is scheduled in advance, its decoding can be mixed with the pre - filling blocks of other requests. Another problem is that the current system uses the number of Tokens and requests as batching thresholds, limiting the ability to batch more Tokens in decoding - dominated Token batching and failing to fully utilize the GPU even when the memory is sufficient. 3. **Optimization space for the prefix - sharing attention mechanism**: - Existing work supports the prefix - sharing attention mechanism by calculating attention on different KV blocks, but the calculations on different blocks are completed in different kernels. On the one hand, the separated kernels increase the tail effect of the GPU kernels; on the other hand, the startup overhead of multiple kernels is also high. ### Solutions To address the above problems, BatchLLM proposes the following optimization methods: 1. **Explicit prefix identification and sharing**: - Before handling large - scale batch tasks, BatchLLM explicitly identifies the common prefix in the entire batch, avoiding missing the opportunity for prefix sharing due to implicit caching. In addition, BatchLLM uses a dynamic programming algorithm to reconstruct the prefix tree, compressing multi - level prefixes into single - level prefixes to avoid the system complexity and kernel overhead brought by multi - level prefixes. 2. **Group scheduling, request re - ordering, and memory - centered Token batching**: - BatchLLM schedules a group of requests sharing the same prefix as a unit, making prefix sharing more convenient and shortening the lifespan of the prefix KV memory. BatchLLM re - orders requests according to the prompt length and the estimated decoding length, giving priority to scheduling requests with a larger ratio of decoding length to prompt length. In this way, longer prompts will be scheduled later and can be better mixed with earlier decoded Tokens. At the same time, BatchLLM forms Token batching considering KV memory usage, allowing more Tokens to be batched, increasing the size of Token batching, and reducing the "valley" phenomenon in Figure 2. 3. **Horizontally - fused prefix - sharing attention kernel optimization**: - BatchLLM horizontally fuses the calculations on different KV blocks into the same kernel to reduce the tail effect and kernel startup overhead. Although this method is not novel, it is very effective. Through these optimizations, the performance of BatchLLM on micro - benchmarks and two typical industry workloads is 1.1 to 2.0 times higher than that of vLLM.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Multi-Bin Batching for Increasing LLM Inference Throughput

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Fast Inference for Augmented Large Language Models

Efficient LLM inference solution on Intel GPU

Efficient LLM Scheduling by Learning to Rank

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Efficient Memory Management for Large Language Model Serving with PagedAttention

BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Fast Distributed Inference Serving for Large Language Models

Fairness in Serving Large Language Models

A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

FlashDecoding++: Faster Large Language Model Inference on GPUs

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs