DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Yongchao Zhou,Kaifeng Lyu,Ankit Singh Rawat,Aditya Krishna Menon,Afshin Rostamizadeh,Sanjiv Kumar,Jean-François Kagy,Rishabh Agarwal
2024-03-31
Abstract:Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accelerate the text generation speed during the inference process of large - scale language models (LLMs) through speculative decoding (SD), while maintaining the quality of the generated text close to the distribution of the target model. Specifically, the paper points out that the SD method uses a faster draft model to generate multiple tokens, which are then verified in parallel by a larger target model, and finally the text is generated according to the distribution of the target model. However, finding a draft model that is well - aligned with the target model and compact is a challenge. To this end, the authors propose the DistillSpec method, which utilizes knowledge distillation techniques to better align the draft model with the target model, thereby improving the speed and efficiency of SD. ### Main problems 1. **Alignment problem between the draft model and the target model**: In the existing SD methods, the selection and training of the draft model lack effective means to ensure its high - level alignment with the target model, which directly affects the efficiency of SD and the quality of the generated text. 2. **Performance improvement of SD**: How to further improve the speed and efficiency of SD without sacrificing the quality of the generated text. ### Solutions 1. **DistillSpec method**: Through knowledge distillation (KD) technology, the DistillSpec method aims to improve the alignment between the draft model and the target model through the following two key design choices: - **Utilizing policy - data generation**: Train from the data generated by the draft model instead of using the data generated by a fixed dataset or a teacher model. - **Customizing the divergence function**: Select an appropriate divergence function according to the task and decoding strategy to optimize the distillation process. 2. **Experimental verification**: The authors verified the effectiveness of DistillSpec through a series of experiments, including speed improvements on different datasets, improvements in block efficiency, and reductions in actual latency. The experimental results show that DistillSpec can achieve a 10 - 45% speed improvement on multiple tasks and significantly improve the efficiency of SD while maintaining the quality of the generated text. ### Experimental results - **Speed improvement**: DistillSpec achieved a 10 - 45% speed improvement on multiple datasets, especially under greedy decoding. - **Block efficiency**: Algorithms that distill using data generated by the model (such as f - Distill and GKD) are significantly better than algorithms that use a fixed dataset (such as Supervised KD) and perform better in block efficiency. - **Consistency between theory and empirical evidence**: The block efficiency calculated theoretically is highly consistent with the block efficiency measured in practice, verifying the effectiveness of the DistillSpec method. ### Conclusion DistillSpec effectively solves the alignment problem between the draft model and the target model through knowledge distillation technology, significantly improves the speed and efficiency of SD, and provides a new solution for real - time inference of large - scale language models.