Decoding Speculative Decoding

Minghao Yan,Saurabh Agarwal,Shivaram Venkataraman
2024-08-12
Abstract:Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this work, we perform a detailed study comprising over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate the factors that affect the performance gain provided by speculative decoding. Our experiments indicate that the performance of speculative decoding depends heavily on the latency of the draft model, and the draft model's capability in language modeling does not correlate strongly with its performance in speculative decoding. Based on these insights we explore a new design space for draft models and design hardware-efficient draft models for speculative decoding. Our newly designed draft model for LLaMA-65B can provide 111% higher throughput than existing draft models and can generalize further to the LLaMA-2 model family and supervised fine-tuned models.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Decoding Speculative Decoding" mainly explores how to optimize the draft model to improve the inference throughput of large - scale language models (LLMs) in speculative decoding. Specifically, the paper attempts to solve the following key problems: 1. **Selection and design of the draft model**: - **Limitations of existing draft models**: Existing draft models are usually designed to improve accuracy under a given parameter budget, but these models do not perform well in speculative decoding. Through experiments, the paper finds that the latency of the draft model is the main bottleneck of speculative decoding performance, and there is no strong correlation between the accuracy of the draft model in language modeling tasks and its performance in speculative decoding. - **New design space**: Based on the above findings, the paper explores a new design space and proposes a more efficient draft model design method. These newly - designed draft models significantly improve the inference throughput by increasing the width and reducing the depth while maintaining the same number of parameters. 2. **Analysis of performance bottlenecks**: - **Identification of performance bottlenecks**: Through detailed benchmark tests and performance profiling, the paper identifies the main performance bottlenecks in speculative decoding, especially the latency problem of the draft model. - **Improvement of hardware efficiency**: By optimizing the design of the draft model, the paper shows how to significantly improve the hardware efficiency and throughput of speculative decoding without sacrificing accuracy. 3. **Experimental verification**: - **Extensive experiments**: The paper conducted more than 350 experiments, using multiple large - scale language models (such as LLAMA - 65B and OPT - 66B) and draft models of different sizes, to verify the effectiveness of the newly - designed draft models. - **Generalization ability across models**: The paper also verifies the generalization ability of the newly - designed draft models in different model families (such as the LLAMA - 2 series and supervised fine - tuning models), demonstrating their wide applicability. ### Main contributions 1. **Comprehensive experimental research**: - The paper is the first comprehensive experimental study on speculative decoding of the open - source LLAMA - 65B and OPT - 66B models. It conducted more than 352 experiments, revealing the key factors to be considered when selecting and designing draft models. 2. **Systematic redesign of the draft model**: - The paper shows that using the accuracy of language modeling tasks to select draft models may lead to sub - optimal choices, and through experiments, it verifies that redesigning the draft model can increase the throughput of speculative decoding by up to 111% (sampling decoding) and 60% (greedy decoding). 3. **Impact of model and hardware improvements**: - The paper also studies how further improvements in models and hardware affect the design of draft models for future generations of LLMs. ### Conclusion Through in - depth experiments and performance analysis, the paper reveals the crucial role of the draft model in speculative decoding and proposes a new design method that significantly improves the inference throughput of large - scale language models. These findings are not only of great significance for the existing LLM inference optimization but also provide valuable directions for future research.