Abstract:Large language models (LLMs) have achieved remarkable success across diverse tasks, yet their inference processes are hindered by substantial time and energy demands due to single-token generation at each decoding step. While previous methods such as speculative decoding mitigate these inefficiencies by producing multiple tokens per step, each token is still generated by its single-token distribution, thereby enhancing speed without improving effectiveness. In contrast, our work simultaneously enhances inference speed and improves the output effectiveness. We consider multi-token joint decoding (MTJD), which generates multiple tokens from their joint distribution at each iteration, theoretically reducing perplexity and enhancing task performance. However, MTJD suffers from the high cost of sampling from the joint distribution of multiple tokens. Inspired by speculative decoding, we introduce multi-token assisted decoding (MTAD), a novel framework designed to accelerate MTJD. MTAD leverages a smaller auxiliary model to approximate the joint distribution of a larger model, incorporating a verification mechanism that not only ensures the accuracy of this approximation, but also improves the decoding efficiency over conventional speculative decoding. Theoretically, we demonstrate that MTAD closely approximates exact MTJD with bounded error. Empirical evaluations using Llama-2 and OPT models ranging from 13B to 70B parameters across various tasks reveal that MTAD reduces perplexity by 21.2% and improves downstream performance compared to standard single-token sampling. Furthermore, MTAD achieves a 1.42x speed-up and consumes 1.54x less energy than conventional speculative decoding methods. These results highlight MTAD's ability to make multi-token joint decoding both effective and efficient, promoting more sustainable and high-performance deployment of LLMs.

Better & Faster Large Language Models via Multi-token Prediction

Multimodal Latent Language Modeling with Next-Token Diffusion

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

Large Language Models Are Zero-Shot Time Series Forecasters

A Law of Next-Token Prediction in Large Language Models

Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Think before you speak: Training Language Models With Pause Tokens

Tandem Transformers for Inference Efficient LLMs

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference

Faster Speech-LLaMA Inference with Multi-token Prediction

Loop Neural Networks for Parameter Sharing

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Retrofitting (Large) Language Models with Dynamic Tokenization

Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Auto-Regressive Next-Token Predictors are Universal Learners