Abstract:Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in large - language models (LLMs), existing sampling methods perform poorly on tasks requiring precise reasoning, especially at high temperatures. The traditional view is that deterministic decoding methods (such as greedy decoding) are usually superior to stochastic sampling methods on tasks requiring precise reasoning, because the latter focus more on diversity rather than accuracy. However, this view leads to a trade - off between diversity and reasoning accuracy. To solve this problem, the paper proposes a new sampling method - top - nσ. This method operates directly on the pre - softmax logits, using statistical thresholds to distinguish between the noise region and the information region, thereby achieving efficient and stable token filtering. Specifically, the key to the top - nσ method lies in identifying two main parts in the logits: a noise region conforming to a Gaussian distribution and an information region containing significant outliers. In this way, top - nσ can maintain a stable sampling space at different temperatures, without introducing more noisy tokens at high temperatures like existing methods (such as top - p and min - p). ### Main contributions of the paper: 1. **Novel Logit perspective**: The paper introduces a new analytical framework, focusing on the pre - softmax logits distribution, providing fundamental insights into the development of sampling strategies and the improvement of model training. 2. **Efficient Top - nσ algorithm**: A conceptually simple but powerful sampling method is proposed, which operates directly on logits, achieving an improvement in generation quality and maintaining computational efficiency without sorting and without additional softmax transformation. 3. **Applicable to test - time expansion techniques**: The top - nσ algorithm can explore the solution space more meticulously, achieving a better balance between exploration and exploitation, and is especially suitable for test - time expansion techniques. 4. **Theoretical analysis**: A comprehensive quantitative analysis of top - nσ is provided, including the proof of its cumulative probability mass characteristics and temperature invariance, laying a solid theoretical foundation for the implementation and understanding of the method. 5. **Extensive empirical verification**: Through rigorous experiments on four different datasets, a significant improvement in the generation quality of this method is demonstrated, especially at high temperatures. ### Main findings: - **Logits distribution characteristics**: The paper observes that the logits distribution usually consists of two parts: a noise region conforming to a Gaussian distribution and an information region containing significant outliers. - **Temperature invariance**: The set of candidate tokens selected by the top - nσ method remains unchanged at different temperatures, in contrast to existing methods (such as top - p and min - p) which introduce more noisy tokens at high temperatures. - **Exploration control**: Through the decoupling of parameters n and temperature, the top - nσ method provides a more refined control of the sampling process. Parameter n determines the boundary between valid tokens and noisy tokens, while the temperature parameter regulates the exploration strategy within the validation token space. In conclusion, by proposing the top - nσ method, this paper solves the shortcomings of existing sampling methods on tasks requiring precise reasoning, especially the performance degradation at high temperatures, providing a new solution for text generation in large - language models.

Top-$nσ$: Not All Logits Are You Need

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

KL-Divergence Guided Temperature Sampling

Efficient Probabilistic Latent Semantic Analysis with Sparsity Control

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

Priority Sampling of Large Language Models for Compilers

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Conformal Nucleus Sampling

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

$T^2$ of Thoughts: Temperature Tree Elicits Reasoning in Large Language Models

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Fast Best-of-N Decoding via Speculative Rejection

Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Scaling LLM Inference with Optimized Sample Compute Allocation

To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO