Abstract:Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the inference latency and throughput of large - language models when generating multiple sequences. Specifically, the paper focuses on how to perform speculative decoding in a batch - processing setting while maintaining its advantage of reducing latency. This is very important in practical applications because many generative AI applications require multiple responses, not just the generation of a single sequence. ### Main Problems 1. **Latency and Throughput**: - Most existing speculative decoding implementations mainly focus on generating a single sequence, while in practical applications, multiple sequences often need to be generated. - How to perform speculative decoding in a batch - processing setting while maintaining its latency - reducing advantage is a non - trivial challenge. 2. **GPU Utilization**: - Large - language models are usually limited by memory bandwidth during the inference process, resulting in low utilization of GPU computing resources. - Batch - processing multiple sequences can amortize the memory I/O cost, thereby increasing GPU utilization, but a larger batch size will lead to higher latency and greater memory occupation. 3. **Limitations of Speculative Decoding**: - Although speculative decoding can improve GPU utilization, it can usually only handle a single sequence, limiting parallelism. - When the batch size is greater than 1, the effect of speculative decoding will decrease significantly because once a token in a certain sequence is rejected, the latency advantage of the entire batch will be lost. ### Solutions The paper proposes **Batched Attention - optimized Speculative Sampling (BASS)**, which is a parallel speculative decoding method that can process multiple sequences simultaneously in a batch - processing setting. BASS solves the above problems in the following ways: 1. **Parallelism**: - BASS achieves parallelism in the batch - processing dimension and the draft - token dimension, thereby increasing GPU utilization. - By using a custom CUDA kernel to process irregular tensors in attention calculations, BASS can effectively handle sequences of different lengths. 2. **Dynamically Adjusting Draft Length**: - BASS uses a heuristic method to dynamically adjust the draft length at each step to adapt to the degree of alignment between different prompts. - This method can dynamically approach the optimal draft length according to the actual situation, thereby generating longer drafts when possible without wasting computing resources to generate useless tokens. 3. **Performance Optimization**: - The paper experimentally verifies the performance improvement of BASS on multiple models, including the CodeGen and OPT models. - The experimental results show that BASS achieves significant acceleration in terms of the average latency of generating the first sequence and all sequences while maintaining the generation quality. ### Experimental Results - **Summary Task (XSum)**: - The latency of BASS in generating the first sequence is reduced by up to 2.81 times, and the average latency of all sequences is reduced by up to 2.34 times. - In the application scenario of generating multiple sequences, the user - perceived latency is significantly reduced. - **Code Generation Task (HumanEval)**: - The latency of BASS in generating the first sequence is reduced by up to 2.65 times, and the average latency of all sequences is reduced by up to 2.43 times. - The accuracy of batch - generated code increases as the batch size increases because the probability of at least one correct generation in the batch increases. ### Conclusion BASS significantly improves the latency and throughput of multi - sequence generation while maintaining the generation quality by implementing efficient speculative decoding in a batch - processing setting. This method is of great significance in practical applications, especially in scenarios where multiple responses need to be generated quickly.

BASS: Batched Attention-optimized Speculative Sampling

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Multi-Candidate Speculative Decoding

Accelerating Large Language Model Decoding with Speculative Sampling

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Decoding Speculative Decoding

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

SpecTr: Fast Speculative Decoding via Optimal Transport

SSSD: Simply-Scalable Speculative Decoding

Online Speculative Decoding

Graph-Structured Speculative Decoding

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

Accelerating LLM Inference with Staged Speculative Decoding

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion