Abstract:Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...

The Implicit Length Bias of Label Smoothing on Beam Search Decoding

Language-Informed Beam Search Decoding for Multilingual Machine Translation

Focus on the Target's Vocabulary: Masked Label Smoothing for Machine Translation.

Data Noising as Smoothing in Neural Network Language Models

Investigating Label Bias in Beam Search for Open-ended Text Generation

Length bias in Encoder Decoder Models and a Case for Global Conditioning

DC-MBR: Distributional Cooling for Minimum Bayesian Risk Decoding

Addressing the Length Bias Problem in Document-Level Neural Machine Translation

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Loose Lips Sink Ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Learning label smoothing for text classification

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

On the Inference Calibration of Neural Machine Translation

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Label Smoothing is Robustification against Model Misspecification

Controlling the Output Length of Neural Machine Translation

The Role of $n$-gram Smoothing in the Age of Neural Networks

Cross Entropy versus Label Smoothing: A Neural Collapse Perspective

Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration