Abstract:Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. However, popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures, leading to incoherent or repetitive outputs. To address this challenge, we propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by scaling according to the top token's probability. We conduct extensive experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, demonstrating that min-p sampling improves both the quality and diversity of generated text, particularly at high temperatures. Moreover, human evaluations reveal a clear preference for min-p sampling in terms of both text quality and diversity. Min-p sampling has been adopted by multiple open-source LLM implementations, highlighting its practical utility and potential impact.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenge of how large language models (LLMs) balance creativity and coherence when generating text. Specifically, existing sampling methods such as top - p (nucleus sampling) often struggle to maintain both high quality and diversity at higher temperatures, resulting in generated text that may be incoherent or repetitive. To solve this problem, the authors propose a new sampling method - min - p sampling, which balances creativity and coherence by dynamically adjusting the sampling threshold, especially in high - temperature settings. ### Main contributions 1. **Introduction of min - p sampling**: A new dynamic truncation method that can dynamically adjust the sampling threshold according to the model's confidence, thereby maintaining the quality and diversity of the generated text even at high temperatures. 2. **Experimental verification**: Through extensive experiments on multiple benchmark datasets, it is proven that min - p sampling is superior to existing sampling methods such as top - p sampling and top - k sampling in terms of the quality and diversity of the generated text. 3. **Human evaluation**: A comprehensive human evaluation was carried out, and the results show that participants prefer the output of min - p sampling and think it performs better in terms of quality and diversity. 4. **Practical guide**: Provides practical experience - based guidelines for using min - p sampling, helping practitioners select appropriate hyperparameters and best practices. 5. **Community adoption**: Min - p sampling has been adopted by multiple open - source LLM implementations, further demonstrating its effectiveness and practicality. ### Method overview The core idea of min - p sampling is to dynamically adjust the sampling threshold according to the model's confidence at each decoding step. The specific steps are as follows: 1. **Calculate the maximum probability**: Determine the maximum probability token in the distribution \( p_{\text{max}}=\max_{v \in V} P(v|x_{1:t - 1}) \). 2. **Define the truncation threshold**: Set a base probability threshold \( p_{\text{base}}\in(0, 1] \) and scale it according to \( p_{\text{max}} \) to obtain the actual truncation threshold \( p_{\text{scaled}} = p_{\text{base}}\times p_{\text{max}} \). 3. **Define the sampling pool**: Construct a sampling pool \( V_{\text{min}} \), which contains tokens with probabilities greater than or equal to \( p_{\text{scaled}} \), \( V_{\text{min}}=\{v\in V:P(v|x_{1:t - 1})\geq p_{\text{scaled}}\} \). 4. **Sample from the pool**: Sample the next token \( x_t \) from \( V_{\text{min}} \) according to the normalized probability: \[ P'(v)=\frac{P(v|x_{1:t - 1})}{\sum_{v'\in V_{\text{min}}} P(v'|x_{1:t - 1})}\quad\text{for}\quad v\in V_{\text{min}} \] ### Experimental results The experimental results show that min - p sampling outperforms existing sampling methods on multiple benchmark datasets (such as GPQA, GSM8K, AlpacaEval creative writing). Especially in high - temperature settings, min - p sampling can better maintain the coherence and diversity of the generated text. In addition, human evaluation also confirms this, and participants generally believe that the output of min - p sampling is superior in terms of quality and diversity. ### Conclusion By introducing min - p sampling, this paper provides an effective solution that can balance creativity and coherence at high temperatures, thereby generating high - quality and diverse text. This method not only performs well in experiments but also has been widely adopted in practical applications, and has important theoretical and practical significance.

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling

Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Top-$nσ$: Not All Logits Are You Need

Priority Sampling of Large Language Models for Compilers

The Effect of Sampling Temperature on Problem Solving in Large Language Models

Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling

Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Gamma Sampling: Fine-grained Controlling Language Models without Training

Truncation Sampling as Language Model Desmoothing

On the Efficacy of Sampling Adapters

Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Quasi-random Multi-Sample Inference for Large Language Models

Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

KL-Divergence Guided Temperature Sampling