Abstract:Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this work, we perform a detailed study comprising over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate the factors that affect the performance gain provided by speculative decoding. Our experiments indicate that the performance of speculative decoding depends heavily on the latency of the draft model, and the draft model's capability in language modeling does not correlate strongly with its performance in speculative decoding. Based on these insights we explore a new design space for draft models and design hardware-efficient draft models for speculative decoding. Our newly designed draft model for LLaMA-65B can provide 111% higher throughput than existing draft models and can generalize further to the LLaMA-2 model family and supervised fine-tuned models.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "Decoding Speculative Decoding" mainly explores how to optimize the draft model to improve the inference throughput of large - scale language models (LLMs) in speculative decoding. Specifically, the paper attempts to solve the following key problems: 1. **Selection and design of the draft model**: - **Limitations of existing draft models**: Existing draft models are usually designed to improve accuracy under a given parameter budget, but these models do not perform well in speculative decoding. Through experiments, the paper finds that the latency of the draft model is the main bottleneck of speculative decoding performance, and there is no strong correlation between the accuracy of the draft model in language modeling tasks and its performance in speculative decoding. - **New design space**: Based on the above findings, the paper explores a new design space and proposes a more efficient draft model design method. These newly - designed draft models significantly improve the inference throughput by increasing the width and reducing the depth while maintaining the same number of parameters. 2. **Analysis of performance bottlenecks**: - **Identification of performance bottlenecks**: Through detailed benchmark tests and performance profiling, the paper identifies the main performance bottlenecks in speculative decoding, especially the latency problem of the draft model. - **Improvement of hardware efficiency**: By optimizing the design of the draft model, the paper shows how to significantly improve the hardware efficiency and throughput of speculative decoding without sacrificing accuracy. 3. **Experimental verification**: - **Extensive experiments**: The paper conducted more than 350 experiments, using multiple large - scale language models (such as LLAMA - 65B and OPT - 66B) and draft models of different sizes, to verify the effectiveness of the newly - designed draft models. - **Generalization ability across models**: The paper also verifies the generalization ability of the newly - designed draft models in different model families (such as the LLAMA - 2 series and supervised fine - tuning models), demonstrating their wide applicability. ### Main contributions 1. **Comprehensive experimental research**: - The paper is the first comprehensive experimental study on speculative decoding of the open - source LLAMA - 65B and OPT - 66B models. It conducted more than 352 experiments, revealing the key factors to be considered when selecting and designing draft models. 2. **Systematic redesign of the draft model**: - The paper shows that using the accuracy of language modeling tasks to select draft models may lead to sub - optimal choices, and through experiments, it verifies that redesigning the draft model can increase the throughput of speculative decoding by up to 111% (sampling decoding) and 60% (greedy decoding). 3. **Impact of model and hardware improvements**: - The paper also studies how further improvements in models and hardware affect the design of draft models for future generations of LLMs. ### Conclusion Through in - depth experiments and performance analysis, the paper reveals the crucial role of the draft model in speculative decoding and proposes a new design method that significantly improves the inference throughput of large - scale language models. These findings are not only of great significance for the existing LLM inference optimization but also provide valuable directions for future research.

Decoding Speculative Decoding

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Graph-Structured Speculative Decoding

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Online Speculative Decoding

On Speculative Decoding for Multimodal Large Language Models

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Accelerating LLM Inference with Staged Speculative Decoding

Cascade Speculative Drafting for Even Faster LLM Inference

SSSD: Simply-Scalable Speculative Decoding

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative Contrastive Decoding

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models