BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian,Sujan Kumar Gonugondla,Sungsoo Ha,Mingyue Shang,Sanjay Krishna Gouda,Ramesh Nallapati,Sudipta Sengupta,Xiaofei Ma,Anoop Deoras
2024-06-27
Abstract:Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the inference latency and throughput of large - language models when generating multiple sequences. Specifically, the paper focuses on how to perform speculative decoding in a batch - processing setting while maintaining its advantage of reducing latency. This is very important in practical applications because many generative AI applications require multiple responses, not just the generation of a single sequence. ### Main Problems 1. **Latency and Throughput**: - Most existing speculative decoding implementations mainly focus on generating a single sequence, while in practical applications, multiple sequences often need to be generated. - How to perform speculative decoding in a batch - processing setting while maintaining its latency - reducing advantage is a non - trivial challenge. 2. **GPU Utilization**: - Large - language models are usually limited by memory bandwidth during the inference process, resulting in low utilization of GPU computing resources. - Batch - processing multiple sequences can amortize the memory I/O cost, thereby increasing GPU utilization, but a larger batch size will lead to higher latency and greater memory occupation. 3. **Limitations of Speculative Decoding**: - Although speculative decoding can improve GPU utilization, it can usually only handle a single sequence, limiting parallelism. - When the batch size is greater than 1, the effect of speculative decoding will decrease significantly because once a token in a certain sequence is rejected, the latency advantage of the entire batch will be lost. ### Solutions The paper proposes **Batched Attention - optimized Speculative Sampling (BASS)**, which is a parallel speculative decoding method that can process multiple sequences simultaneously in a batch - processing setting. BASS solves the above problems in the following ways: 1. **Parallelism**: - BASS achieves parallelism in the batch - processing dimension and the draft - token dimension, thereby increasing GPU utilization. - By using a custom CUDA kernel to process irregular tensors in attention calculations, BASS can effectively handle sequences of different lengths. 2. **Dynamically Adjusting Draft Length**: - BASS uses a heuristic method to dynamically adjust the draft length at each step to adapt to the degree of alignment between different prompts. - This method can dynamically approach the optimal draft length according to the actual situation, thereby generating longer drafts when possible without wasting computing resources to generate useless tokens. 3. **Performance Optimization**: - The paper experimentally verifies the performance improvement of BASS on multiple models, including the CodeGen and OPT models. - The experimental results show that BASS achieves significant acceleration in terms of the average latency of generating the first sequence and all sequences while maintaining the generation quality. ### Experimental Results - **Summary Task (XSum)**: - The latency of BASS in generating the first sequence is reduced by up to 2.81 times, and the average latency of all sequences is reduced by up to 2.34 times. - In the application scenario of generating multiple sequences, the user - perceived latency is significantly reduced. - **Code Generation Task (HumanEval)**: - The latency of BASS in generating the first sequence is reduced by up to 2.65 times, and the average latency of all sequences is reduced by up to 2.43 times. - The accuracy of batch - generated code increases as the batch size increases because the probability of at least one correct generation in the batch increases. ### Conclusion BASS significantly improves the latency and throughput of multi - sequence generation while maintaining the generation quality by implementing efficient speculative decoding in a batch - processing setting. This method is of great significance in practical applications, especially in scenarios where multiple responses need to be generated quickly.