Abstract:Large language models (LLMs) have achieved remarkable success across diverse tasks, yet their inference processes are hindered by substantial time and energy demands due to single-token generation at each decoding step. While previous methods such as speculative decoding mitigate these inefficiencies by producing multiple tokens per step, each token is still generated by its single-token distribution, thereby enhancing speed without improving effectiveness. In contrast, our work simultaneously enhances inference speed and improves the output effectiveness. We consider multi-token joint decoding (MTJD), which generates multiple tokens from their joint distribution at each iteration, theoretically reducing perplexity and enhancing task performance. However, MTJD suffers from the high cost of sampling from the joint distribution of multiple tokens. Inspired by speculative decoding, we introduce multi-token assisted decoding (MTAD), a novel framework designed to accelerate MTJD. MTAD leverages a smaller auxiliary model to approximate the joint distribution of a larger model, incorporating a verification mechanism that not only ensures the accuracy of this approximation, but also improves the decoding efficiency over conventional speculative decoding. Theoretically, we demonstrate that MTAD closely approximates exact MTJD with bounded error. Empirical evaluations using Llama-2 and OPT models ranging from 13B to 70B parameters across various tasks reveal that MTAD reduces perplexity by 21.2% and improves downstream performance compared to standard single-token sampling. Furthermore, MTAD achieves a 1.42x speed-up and consumes 1.54x less energy than conventional speculative decoding methods. These results highlight MTAD's ability to make multi-token joint decoding both effective and efficient, promoting more sustainable and high-performance deployment of LLMs.

On Speculative Decoding for Multimodal Large Language Models

Decoding Speculative Decoding

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Graph-Structured Speculative Decoding

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Accelerating LLM Inference with Staged Speculative Decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs

Speculative Contrastive Decoding

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Multi-Candidate Speculative Decoding

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Online Speculative Decoding

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models