Abstract:Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by $\mathbf{1.75}\times$ and $\mathbf{1.76}\times$, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the efficiency bottleneck problem of Auto - Regressive (AR) models in image generation. Specifically, although AR models perform well in image - generation tasks, their sequential generation characteristic leads to a relatively slow generation speed. This problem is more obvious especially when compared with parallel generation methods such as Generative Adversarial Networks (GANs) and Diffusion Models. To accelerate the generation process of AR models, researchers have proposed the Speculative Decoding technique, which improves the generation speed by predicting multiple tokens and validating them all at once. However, the existing speculative decoding methods have poor application effects in visual AR models. The main reason is that there is a problem called "Token Selection Ambiguity" in visual AR models. This problem is manifested as that when visual AR models predict the next token, they often assign relatively uniform low probabilities to multiple tokens, which makes it difficult for the candidate tokens in speculative decoding to be effectively accepted, thus reducing the acceleration effect. Therefore, the paper proposes a new method - LANTERN (Latent Neighbor Token Acceptance Relaxation), which solves the token selection ambiguity problem by relaxing the acceptance conditions of speculative decoding and using the interchangeability of tokens in the latent space. The LANTERN method not only significantly improves the acceptance rate of speculative decoding but also achieves a significant acceleration effect without sacrificing image quality and semantic coherence substantially. ### Main contributions 1. **Identify and define the token selection ambiguity problem**: This is a key problem that hinders the effective application of speculative decoding in visual AR models. 2. **Propose the LANTERN method**: By relaxing the acceptance conditions of speculative decoding and using the interchangeability of tokens in the latent space, the token selection ambiguity problem is solved. 3. **Verify the effectiveness of LANTERN through experiments**: The experimental results on the LlamaGen model show that LANTERN can significantly improve the generation speed while maintaining image quality. ### Specific technical details #### 1. Token selection ambiguity problem - **Language models vs visual AR models**: - Tokens in language models represent discrete words or sub - words, having a structured and predictable sequence, so the probability distribution of the next token is usually more concentrated. - Visual AR models deal with pixels or image blocks. These tokens form a continuous and highly complex space, resulting in a more dispersed probability distribution of the next token and increasing the uncertainty of token selection. - **Empirical analysis**: - Through experiments on LlamaGen and Vicuna - 7B, it is found that when visual AR models predict the next token, they often assign lower probabilities to multiple tokens, while language models can predict the next token more accurately. #### 2. LANTERN method - **Latent - space proximity**: - Utilize the latent - space proximity in visual AR models, that is, tokens that are close in the latent space can be interchanged without significantly affecting the visual semantics of the generated image. - This has been verified by experiments. By resampling from the nearest 100 tokens after each sampling, the generated image is very similar to the image generated by the original method. - **Relax the acceptance conditions**: - The original acceptance conditions are based on the probability alignment of the drafter and target models, but under the token selection ambiguity problem, the acceptance rate will drop significantly. - LANTERN increases the acceptance rate by aggregating the probabilities of the nearest - neighbor tokens of the candidate tokens, thereby alleviating the token selection ambiguity problem. - **Control distribution deviation**: - By introducing the total variation distance (

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

On Speculative Decoding for Multimodal Large Language Models

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Decoding Speculative Decoding

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Mixture of Attentions For Speculative Decoding

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Parallel Speculative Decoding with Adaptive Draft Length