Online Speculative Decoding

Xiaoxuan Liu,Lanxiang Hu,Peter Bailis,Alvin Cheung,Zhijie Deng,Ion Stoica,Hao Zhang
2024-06-10
Abstract:Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at <a class="link-external link-https" href="https://github.com/LiuXiaoxuanPKU/OSD" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the high latency issue in the inference process of large language models (LLMs). Specifically, existing speculative decoding techniques can accelerate the inference process by using a small draft model to predict the output of a large target model. However, the effectiveness of this technique is limited by the low prediction accuracy of the draft model, especially when faced with diverse text inputs. Additionally, there is a significant capability gap between the draft model and the target model, which further affects the prediction accuracy. To solve these problems, the paper introduces a new method called Online Speculative Decoding (OSD). The main idea of OSD is to continuously update the draft model to adapt to the distribution of user query data, thereby reducing the discrepancy between the training data distribution and the query data distribution, and improving the prediction accuracy of the draft model for the target model's output. In this way, OSD can significantly improve inference speed and reduce latency while maintaining the compactness of the draft model. Specifically, OSD achieves its goals through the following methods: 1. **Knowledge Distillation**: During the speculative decoding process, knowledge distillation techniques are used to enhance the alignment between the draft model and the target model. The draft model proposes potential output tokens and their probability distributions, and the target model evaluates these proposals and corrects errors, allowing the draft model to learn from this rich information. 2. **Dynamic Update of the Draft Model**: Based on the correction results of the target model, the draft model is periodically fine-tuned to adapt to the ever-changing user query distribution. 3. **Query Routing**: Each query is routed to the draft model that is most suitable for the specific query distribution, improving prediction accuracy by focusing on queries from specific domains. Through these methods, OSD demonstrates significant performance improvements on multiple datasets, increasing the token acceptance rate and thereby significantly reducing inference latency.