Abstract:Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

CLLMs: Consistency Large Language Models

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

LLMCad: Fast and Scalable On-device Large Language Model Inference

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

On Speculative Decoding for Multimodal Large Language Models

Graph-Structured Speculative Decoding

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Faster Speech-LLaMA Inference with Multi-token Prediction