Abstract:Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accelerate the text generation speed during the inference process of large - scale language models (LLMs) through speculative decoding (SD), while maintaining the quality of the generated text close to the distribution of the target model. Specifically, the paper points out that the SD method uses a faster draft model to generate multiple tokens, which are then verified in parallel by a larger target model, and finally the text is generated according to the distribution of the target model. However, finding a draft model that is well - aligned with the target model and compact is a challenge. To this end, the authors propose the DistillSpec method, which utilizes knowledge distillation techniques to better align the draft model with the target model, thereby improving the speed and efficiency of SD. ### Main problems 1. **Alignment problem between the draft model and the target model**: In the existing SD methods, the selection and training of the draft model lack effective means to ensure its high - level alignment with the target model, which directly affects the efficiency of SD and the quality of the generated text. 2. **Performance improvement of SD**: How to further improve the speed and efficiency of SD without sacrificing the quality of the generated text. ### Solutions 1. **DistillSpec method**: Through knowledge distillation (KD) technology, the DistillSpec method aims to improve the alignment between the draft model and the target model through the following two key design choices: - **Utilizing policy - data generation**: Train from the data generated by the draft model instead of using the data generated by a fixed dataset or a teacher model. - **Customizing the divergence function**: Select an appropriate divergence function according to the task and decoding strategy to optimize the distillation process. 2. **Experimental verification**: The authors verified the effectiveness of DistillSpec through a series of experiments, including speed improvements on different datasets, improvements in block efficiency, and reductions in actual latency. The experimental results show that DistillSpec can achieve a 10 - 45% speed improvement on multiple tasks and significantly improve the efficiency of SD while maintaining the quality of the generated text. ### Experimental results - **Speed improvement**: DistillSpec achieved a 10 - 45% speed improvement on multiple datasets, especially under greedy decoding. - **Block efficiency**: Algorithms that distill using data generated by the model (such as f - Distill and GKD) are significantly better than algorithms that use a fixed dataset (such as Supervised KD) and perform better in block efficiency. - **Consistency between theory and empirical evidence**: The block efficiency calculated theoretically is highly consistent with the block efficiency measured in practice, verifying the effectiveness of the DistillSpec method. ### Conclusion DistillSpec effectively solves the alignment problem between the draft model and the target model through knowledge distillation technology, significantly improves the speed and efficiency of SD, and provides a new solution for real - time inference of large - scale language models.

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Online Speculative Decoding

Decoding Speculative Decoding

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation

Improving Multi-candidate Speculative Decoding

SpecTr: Fast Speculative Decoding via Optimal Transport

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

Parallel Speculative Decoding with Adaptive Draft Length

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Graph-Structured Speculative Decoding

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Cascade Speculative Drafting for Even Faster LLM Inference

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation