Abstract:Introduced to enhance the efficiency of large language model (LLM) inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal Cascade optimizes time allocation in drafting for improved efficiency. Combining both cascades, CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments, while maintaining the same output distribution as the target model. Our code is publicly available at <a class="link-external link-https" href="https://github.com/lfsszd/CS-Drafting" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high - latency issue in the inference process of large - scale language models (LLMs). Specifically, as the scale of LLMs continues to increase, especially in long - text generation tasks, due to the token - by - token generation method caused by the autoregressive generation method, the model inference process becomes very time - consuming. To solve this problem, the existing speculative decoding technology first generates a draft using a small model and then has it reviewed by the target large model to reduce the number of runs of the target model, thereby improving efficiency. However, this technology still has the disadvantages of inefficient autoregressive generation and unreasonable time allocation for each token generation in the process of generating the draft, and these factors together lead to the sub - optimal performance of the speculative decoding technology. To further improve the inference efficiency of LLMs, the paper introduces a new algorithm - Cascade Speculative Drafting (CS Drafting). This algorithm optimizes the speculative decoding process by introducing two mechanisms: Vertical Cascade and Horizontal Cascade. The Vertical Cascade eliminates autoregressive generation in neural models, while the Horizontal Cascade optimizes the time allocation for generating drafts to improve overall efficiency. Combining these two mechanisms, CS Drafting achieves an additional acceleration of up to 81% compared to the traditional speculative decoding technology in experiments while maintaining the same output distribution as the target model. The main contributions of the paper include: 1. Introducing the CS Drafting algorithm, which can improve the inference speed of language models without sacrificing generation quality. 2. Providing a theoretical analysis to support the effectiveness of the proposed CS Drafting method. 3. Proving through experiments that CS Drafting can achieve further acceleration compared to the traditional speculative decoding technology under different tasks and settings.

Cascade Speculative Drafting for Even Faster LLM Inference

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Graph-Structured Speculative Decoding

Decoding Speculative Decoding

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Faster Cascades via Speculative Decoding

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Improving Multi-candidate Speculative Decoding

AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Online Speculative Decoding

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Parallel Speculative Decoding with Adaptive Draft Length

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput