Abstract:Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at <a class="link-external link-https" href="https://github.com/LiuXiaoxuanPKU/OSD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the high latency issue in the inference process of large language models (LLMs). Specifically, existing speculative decoding techniques can accelerate the inference process by using a small draft model to predict the output of a large target model. However, the effectiveness of this technique is limited by the low prediction accuracy of the draft model, especially when faced with diverse text inputs. Additionally, there is a significant capability gap between the draft model and the target model, which further affects the prediction accuracy. To solve these problems, the paper introduces a new method called Online Speculative Decoding (OSD). The main idea of OSD is to continuously update the draft model to adapt to the distribution of user query data, thereby reducing the discrepancy between the training data distribution and the query data distribution, and improving the prediction accuracy of the draft model for the target model's output. In this way, OSD can significantly improve inference speed and reduce latency while maintaining the compactness of the draft model. Specifically, OSD achieves its goals through the following methods: 1. **Knowledge Distillation**: During the speculative decoding process, knowledge distillation techniques are used to enhance the alignment between the draft model and the target model. The draft model proposes potential output tokens and their probability distributions, and the target model evaluates these proposals and corrects errors, allowing the draft model to learn from this rich information. 2. **Dynamic Update of the Draft Model**: Based on the correction results of the target model, the draft model is periodically fine-tuned to adapt to the ever-changing user query distribution. 3. **Query Routing**: Each query is routed to the draft model that is most suitable for the specific query distribution, improving prediction accuracy by focusing on queries from specific domains. Through these methods, OSD demonstrates significant performance improvements on multiple datasets, increasing the token acceptance rate and thereby significantly reducing inference latency.

Online Speculative Decoding

Decoding Speculative Decoding

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Graph-Structured Speculative Decoding

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

SSSD: Simply-Scalable Speculative Decoding

Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding

Cascade Speculative Drafting for Even Faster LLM Inference

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Improving Multi-candidate Speculative Decoding

Parallel Speculative Decoding with Adaptive Draft Length

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Speculative Contrastive Decoding

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models