Abstract:Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model's output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the efficiency and performance of Multi - Candidate Speculative Decoding (MCSD) in the inference of large - language models (LLMs). Specifically, the authors aim to solve the problems existing in the existing MCSD methods by introducing the following three improvement methods: 1. **Multi - candidate generation for target - model initialization**: The existing MCSD methods rely on the draft model to generate the entire multi - candidate token tree, and then only sample one token from the target model or the normalized output distributions of the target and draft models for verification. Due to the differences in the output distributions of the draft model and the target model, this method may lead to a low acceptance rate. To solve this problem, the authors propose to use the target model to generate multiple tokens to initialize the multi - candidate sequence, thereby increasing the acceptance rate. 2. **Dynamically sliced topology - aware causal mask**: The existing MCSD methods usually construct the topology - aware causal mask only once at initialization, which limits their adaptability. For this reason, the authors introduce a method of dynamically sliced topology - aware causal mask, which allows the decision - making model to dynamically determine the length of multi - candidate draft token generation without regenerating a new topology - aware causal mask in each iteration. 3. **Early - stopping decision - making model**: To further optimize the generation process, the authors design a low - complexity MLP model as a decision - making model to dynamically determine whether it is necessary to stop early in the draft generation process. This decision - making model can predict the probability of the target model accepting tokens according to the hidden state of the input sequence or other features, and accordingly decide whether to terminate the generation process early. Through these improvements, the authors hope to significantly improve the inference speed of MCSD while maintaining the generation quality. The experimental results show that when using Llama 2 - 7B as the target model and JackFram 68M as the draft model, their method achieves a maximum speed improvement of 27.5% compared to the baseline MCSD method. In addition, the authors also evaluate the influence of different draft models on the output quality and conduct an ablation study to analyze the functions of each component. In summary, the main contribution of this paper lies in proposing several innovative techniques to improve MCSD, thereby significantly accelerating the inference process of large - language models without sacrificing the generation quality.

Improving Multi-candidate Speculative Decoding

Improving Multi-candidate Speculative Decoding

Multi-Candidate Speculative Decoding

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Graph-Structured Speculative Decoding

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Decoding Speculative Decoding

Cascade Speculative Drafting for Even Faster LLM Inference

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Online Speculative Decoding

AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Parallel Speculative Decoding with Adaptive Draft Length

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Mixture of Attentions For Speculative Decoding