Abstract:Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, which target a balance between diversity and quality via temperature tuning and tail truncation (e.g., top-k and top-p sampling). Considering the high dynamic range of the candidate next-token given different prefixes, recent studies propose to adaptively truncate the tail of LLM's predicted distribution. Although improved results haven been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated truncation parameters and exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address two major issues in sampling decoding strategies in large language models (LLMs): 1. **Dependency on Parameter Tuning**: The effectiveness of existing sampling methods (such as Top-k and Top-p sampling) is highly dependent on the choice of parameters. These parameters are often tuned on a very sparse grid, which is not only computationally expensive but also may result in unstable tuning outcomes due to the nonlinear relationship between performance and parameters. 2. **User Ignorance of Optimal Parameters**: In practical applications, users often select parameters through a few trials based on their needs to balance the diversity and quality of generated text. However, there are no universal optimal hyperparameters for different scenarios, and users usually do not know the best parameters for their tasks. ### Background and Motivation - **Limitations of Existing Methods**: Traditional probability maximization methods (such as beam search) tend to produce repetitive and incoherent text, especially in open-ended tasks. Therefore, sampling decoding strategies (such as Top-p and Top-k sampling) are widely adopted, but these methods require adjusting temperature and truncation positions to balance diversity and quality, which involves a lot of trial and error. - **Adaptive Tail Truncation Mechanism**: Recent studies have proposed some adaptive tail truncation mechanisms based on different criteria or assumptions, which can dynamically adjust the size of the allowed token set according to the given prefix. Although these methods have shown improvements in open-ended text generation tasks, their effectiveness still highly depends on parameter selection and example texts. ### Research Objectives - **Establish Intrinsic Evaluation Benchmark**: The authors propose a systematic approach to estimate the intrinsic adaptability of sampling methods in different contexts by constructing a Context-Preserving Trie (CP-Trie). This method can evaluate the theoretical capacity of sampling methods independently of parameter tuning. - **Identify Sweet Spots of Existing Sampling Methods**: Through systematic evaluation, the authors aim to identify the optimal parameter ranges for different sampling methods at different risk levels, providing guidance for parameter selection in practical applications. ### Main Contributions 1. **Establish Intrinsic Evaluation Benchmark**: Based on CP-Trie data, the authors propose diversity and stability metrics to evaluate the theoretical capacity of different sampling decoding methods. 2. **Comprehensive Comparison of Existing Sampling Methods**: Using the proposed evaluation benchmark, the authors conduct a comprehensive comparison of existing sampling methods, providing guidelines for method selection and parameter tuning in practical applications. ### Method Overview - **Define the Problem**: Using CP-Trie data, the authors calculate the optimal allowed set under a given prefix and define the recall and risk of sampling methods. - **Probability-Independent Metrics**: By checking whether the predicted next token is within the data-supported range, the authors define recall and risk, avoiding the unreliability of probabilities. - **Parameter-Independent Evaluation**: By evaluating the average recall and the standard deviation of risk at a given risk level, the authors assess diversity and stability, eliminating the significant impact of parameter tuning. ### Experimental Results - **Comparison at Different Risk Levels**: The authors conduct a comprehensive evaluation of various sampling methods at different risk levels. The results show that adaptive sampling and Mirostat perform best in terms of diversity and stability, while Top-p sampling performs poorly. - **Impact of Model Size**: Within the same model family, larger models have higher average recall at the same risk level, consistent with the fact that larger models better capture the distribution of human text. ### Conclusion By establishing an intrinsic evaluation benchmark, this paper systematically evaluates the adaptability and theoretical capacity of different sampling methods, providing scientific guidance for method selection and parameter tuning in practical applications.

Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation

Closing the Curious Case of Neural Text Degeneration

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding

Truncation Sampling as Language Model Desmoothing

Penalizing the High-likelihood: A Novel Sampling Method for Open-ended Neural Text Generation via Inverse Probability Weighting

Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation

Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Improving Open-Ended Text Generation via Adaptive Decoding

Improve the Diversity and Novelty for Open-Ended Neural Text Generation via Inverse Probability Weighting.

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

Cascade Reward Sampling for Efficient Decoding-Time Alignment

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

On the Efficacy of Sampling Adapters

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling