Abstract:The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficiency challenges in the batch - inference process of large - scale language models (LLMs). Specifically, existing batch - inference methods will introduce a large amount of idle computations when handling queries of different lengths. That is, when one query is completed while other queries are not yet completed, the completed query will still occupy GPU resources for unnecessary computations. In addition, when the traditional batch - inference framework processes newly added queries, since it needs to align vector dimensions, it cannot directly add the new query to the currently - processing batch, which further limits the effective utilization of resources. To address these challenges, the paper proposes an efficient batch - inference scheme named Baton. Through dynamic re - batching, it achieves nearly zero idle computations without increasing additional resource consumption. The main contributions of Baton are as follows: 1. **Vector Reshaping and Embedding Strategy**: Through padding operations, the vectors of newly inserted queries and currently - processing queries are aligned, and a new attention mask is generated to ensure the correctness of subsequent inference iterations. This enables new queries to be inserted into the current batch without consuming additional resources. 2. **Prefilling and Decoding Separation Mechanism**: By decoupling the pre - filling and decoding stages, Baton can embed the Keys and Values of new queries into the KV cache of the current batch without introducing additional padding (for dimension alignment), thereby avoiding the additional idle computations introduced by the pre - filling process of new queries. Experimental results show that, compared with the current state - of - the - art solution Orca, Baton improves the query - processing throughput by 1.29 to 1.75 times. This improvement is of great significance for enhancing the performance of LLM - based interactive network services and applications.

BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

Multi-Bin Batching for Increasing LLM Inference Throughput

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Batch Prompting: Efficient Inference with Large Language Model APIs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Fast Distributed Inference Serving for Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Efficient LLM Inference with Kcache

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

LLMCad: Fast and Scalable On-device Large Language Model Inference

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness

Efficient Memory Management for Large Language Model Serving with PagedAttention

D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

SparQ Attention: Bandwidth-Efficient LLM Inference