Accelerating LLM Inference with Staged Speculative Decoding

Benjamin Spector,Chris Re
2023-08-09
Abstract:Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of low arithmetic intensity faced by large language models (LLMs) during small-batch, on-device inference, and proposes a new algorithm called "Staged Speculative Decoding" to accelerate the inference process. Specifically, the paper addresses the following key issues: 1. **Improving Small-Batch Inference Efficiency**: Due to the low utilization of computational resources during small-batch inference, the inference speed is slow and efficiency is not high. By improving speculative decoding techniques, the paper aims to increase inference speed, especially in application scenarios requiring low-latency responses. 2. **Enhancing Personalized Experience**: By accelerating local inference, LLMs can be customized according to individual user needs, thereby providing a more personalized user experience. 3. **Protecting Data Privacy**: Local inference reduces the need for data transmission to the cloud, thereby enhancing data security and privacy. To achieve the above goals, the paper proposes two main improvements: 1. **Tree-Structured Batching**: Reorganizing speculative batching into a tree structure of possible sequences to create larger and higher-quality speculative batches more quickly. 2. **Multi-Level Speculative Decoding**: Performing speculative decoding not only on the original model but also on a smaller model used to generate preliminary predictions (the "draft model") to further improve performance. Through these improvements, the authors conducted experimental validation using the GPT-2-L model with 762 million parameters, achieving a 3.16-fold reduction in single-batch decoding latency without sacrificing output quality.