Abstract:Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

SpecPIM: Accelerating Speculative Inference on PIM-Enabled System Via Architecture-Dataflow Co-Exploration

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

SPEED: Speculative Pipelined Execution for Efficient Decoding

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

Accelerating LLM Inference with Staged Speculative Decoding

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution

Distributed Speculative Inference of Large Language Models

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Cascade Speculative Drafting for Even Faster LLM Inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

PLD+: Accelerating LLM inference by leveraging Language Model Artifacts

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference