Accelerating LLM Inference with Staged Speculative Decoding

Benjamin Spector,Chris Re

2023-08-09

Abstract:Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper primarily addresses the issue of low arithmetic intensity faced by large language models (LLMs) during small-batch, on-device inference, and proposes a new algorithm called "Staged Speculative Decoding" to accelerate the inference process. Specifically, the paper addresses the following key issues: 1. **Improving Small-Batch Inference Efficiency**: Due to the low utilization of computational resources during small-batch inference, the inference speed is slow and efficiency is not high. By improving speculative decoding techniques, the paper aims to increase inference speed, especially in application scenarios requiring low-latency responses. 2. **Enhancing Personalized Experience**: By accelerating local inference, LLMs can be customized according to individual user needs, thereby providing a more personalized user experience. 3. **Protecting Data Privacy**: Local inference reduces the need for data transmission to the cloud, thereby enhancing data security and privacy. To achieve the above goals, the paper proposes two main improvements: 1. **Tree-Structured Batching**: Reorganizing speculative batching into a tree structure of possible sequences to create larger and higher-quality speculative batches more quickly. 2. **Multi-Level Speculative Decoding**: Performing speculative decoding not only on the original model but also on a smaller model used to generate preliminary predictions (the "draft model") to further improve performance. Through these improvements, the authors conducted experimental validation using the GPT-2-L model with 762 million parameters, achieving a 3.16-fold reduction in single-batch decoding latency without sacrificing output quality.

Accelerating LLM Inference with Staged Speculative Decoding

Decoding Speculative Decoding

Graph-Structured Speculative Decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

SSSD: Simply-Scalable Speculative Decoding

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

On Speculative Decoding for Multimodal Large Language Models

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Online Speculative Decoding

Speculative Contrastive Decoding

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Cascade Speculative Drafting for Even Faster LLM Inference

Mixture of Attentions For Speculative Decoding

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths