BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

Peizhuang Cong,Qizhi Chen,Haochen Zhao,Tong Yang
2024-10-24
Abstract:The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficiency challenges in the batch - inference process of large - scale language models (LLMs). Specifically, existing batch - inference methods will introduce a large amount of idle computations when handling queries of different lengths. That is, when one query is completed while other queries are not yet completed, the completed query will still occupy GPU resources for unnecessary computations. In addition, when the traditional batch - inference framework processes newly added queries, since it needs to align vector dimensions, it cannot directly add the new query to the currently - processing batch, which further limits the effective utilization of resources. To address these challenges, the paper proposes an efficient batch - inference scheme named Baton. Through dynamic re - batching, it achieves nearly zero idle computations without increasing additional resource consumption. The main contributions of Baton are as follows: 1. **Vector Reshaping and Embedding Strategy**: Through padding operations, the vectors of newly inserted queries and currently - processing queries are aligned, and a new attention mask is generated to ensure the correctness of subsequent inference iterations. This enables new queries to be inserted into the current batch without consuming additional resources. 2. **Prefilling and Decoding Separation Mechanism**: By decoupling the pre - filling and decoding stages, Baton can embed the Keys and Values of new queries into the KV cache of the current batch without introducing additional padding (for dimension alignment), thereby avoiding the additional idle computations introduced by the pre - filling process of new queries. Experimental results show that, compared with the current state - of - the - art solution Orca, Baton improves the query - processing throughput by 1.29 to 1.75 times. This improvement is of great significance for enhancing the performance of LLM - based interactive network services and applications.