Abstract:Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, which target a balance between diversity and quality via temperature tuning and tail truncation (e.g., top-k and top-p sampling). Considering the high dynamic range of the candidate next-token given different prefixes, recent studies propose to adaptively truncate the tail of LLM's predicted distribution. Although improved results haven been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated truncation parameters and exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address two major issues in sampling decoding strategies in large language models (LLMs):
1. **Dependency on Parameter Tuning**: The effectiveness of existing sampling methods (such as Top-k and Top-p sampling) is highly dependent on the choice of parameters. These parameters are often tuned on a very sparse grid, which is not only computationally expensive but also may result in unstable tuning outcomes due to the nonlinear relationship between performance and parameters.
2. **User Ignorance of Optimal Parameters**: In practical applications, users often select parameters through a few trials based on their needs to balance the diversity and quality of generated text. However, there are no universal optimal hyperparameters for different scenarios, and users usually do not know the best parameters for their tasks.
### Background and Motivation
- **Limitations of Existing Methods**: Traditional probability maximization methods (such as beam search) tend to produce repetitive and incoherent text, especially in open-ended tasks. Therefore, sampling decoding strategies (such as Top-p and Top-k sampling) are widely adopted, but these methods require adjusting temperature and truncation positions to balance diversity and quality, which involves a lot of trial and error.
- **Adaptive Tail Truncation Mechanism**: Recent studies have proposed some adaptive tail truncation mechanisms based on different criteria or assumptions, which can dynamically adjust the size of the allowed token set according to the given prefix. Although these methods have shown improvements in open-ended text generation tasks, their effectiveness still highly depends on parameter selection and example texts.
### Research Objectives
- **Establish Intrinsic Evaluation Benchmark**: The authors propose a systematic approach to estimate the intrinsic adaptability of sampling methods in different contexts by constructing a Context-Preserving Trie (CP-Trie). This method can evaluate the theoretical capacity of sampling methods independently of parameter tuning.
- **Identify Sweet Spots of Existing Sampling Methods**: Through systematic evaluation, the authors aim to identify the optimal parameter ranges for different sampling methods at different risk levels, providing guidance for parameter selection in practical applications.
### Main Contributions
1. **Establish Intrinsic Evaluation Benchmark**: Based on CP-Trie data, the authors propose diversity and stability metrics to evaluate the theoretical capacity of different sampling decoding methods.
2. **Comprehensive Comparison of Existing Sampling Methods**: Using the proposed evaluation benchmark, the authors conduct a comprehensive comparison of existing sampling methods, providing guidelines for method selection and parameter tuning in practical applications.
### Method Overview
- **Define the Problem**: Using CP-Trie data, the authors calculate the optimal allowed set under a given prefix and define the recall and risk of sampling methods.
- **Probability-Independent Metrics**: By checking whether the predicted next token is within the data-supported range, the authors define recall and risk, avoiding the unreliability of probabilities.
- **Parameter-Independent Evaluation**: By evaluating the average recall and the standard deviation of risk at a given risk level, the authors assess diversity and stability, eliminating the significant impact of parameter tuning.
### Experimental Results
- **Comparison at Different Risk Levels**: The authors conduct a comprehensive evaluation of various sampling methods at different risk levels. The results show that adaptive sampling and Mirostat perform best in terms of diversity and stability, while Top-p sampling performs poorly.
- **Impact of Model Size**: Within the same model family, larger models have higher average recall at the same risk level, consistent with the fact that larger models better capture the distribution of human text.
### Conclusion
By establishing an intrinsic evaluation benchmark, this paper systematically evaluates the adaptability and theoretical capacity of different sampling methods, providing scientific guidance for method selection and parameter tuning in practical applications.