Abstract:Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models

Entropy-Based Decoding for Retrieval-Augmented Large Language Models

SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Improving Factuality by Contrastive Decoding with Factual and Hallucination Prompts

Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused

Graph-Structured Speculative Decoding

Is Factuality Decoding a Free Lunch for LLMs? Evaluation on Knowledge Editing Benchmark

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Collaborative decoding of critical tokens for boosting factuality of large language models

Truth or Deceit? A Bayesian Decoding Game Enhances Consistency and Reliability

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Alleviating Hallucinations of Large Language Models through Induced Hallucinations

Language Models Hallucinate, but May Excel at Fact Verification

A Thorough Examination of Decoding Methods in the Era of LLMs

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

On Large Language Models' Hallucination with Regard to Known Facts

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding