Top-$nσ$: Not All Logits Are You Need

Chenxia Tang,Jianchun Liu,Hongli Xu,Liusheng Huang
2024-11-12
Abstract:Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in large - language models (LLMs), existing sampling methods perform poorly on tasks requiring precise reasoning, especially at high temperatures. The traditional view is that deterministic decoding methods (such as greedy decoding) are usually superior to stochastic sampling methods on tasks requiring precise reasoning, because the latter focus more on diversity rather than accuracy. However, this view leads to a trade - off between diversity and reasoning accuracy. To solve this problem, the paper proposes a new sampling method - top - nσ. This method operates directly on the pre - softmax logits, using statistical thresholds to distinguish between the noise region and the information region, thereby achieving efficient and stable token filtering. Specifically, the key to the top - nσ method lies in identifying two main parts in the logits: a noise region conforming to a Gaussian distribution and an information region containing significant outliers. In this way, top - nσ can maintain a stable sampling space at different temperatures, without introducing more noisy tokens at high temperatures like existing methods (such as top - p and min - p). ### Main contributions of the paper: 1. **Novel Logit perspective**: The paper introduces a new analytical framework, focusing on the pre - softmax logits distribution, providing fundamental insights into the development of sampling strategies and the improvement of model training. 2. **Efficient Top - nσ algorithm**: A conceptually simple but powerful sampling method is proposed, which operates directly on logits, achieving an improvement in generation quality and maintaining computational efficiency without sorting and without additional softmax transformation. 3. **Applicable to test - time expansion techniques**: The top - nσ algorithm can explore the solution space more meticulously, achieving a better balance between exploration and exploitation, and is especially suitable for test - time expansion techniques. 4. **Theoretical analysis**: A comprehensive quantitative analysis of top - nσ is provided, including the proof of its cumulative probability mass characteristics and temperature invariance, laying a solid theoretical foundation for the implementation and understanding of the method. 5. **Extensive empirical verification**: Through rigorous experiments on four different datasets, a significant improvement in the generation quality of this method is demonstrated, especially at high temperatures. ### Main findings: - **Logits distribution characteristics**: The paper observes that the logits distribution usually consists of two parts: a noise region conforming to a Gaussian distribution and an information region containing significant outliers. - **Temperature invariance**: The set of candidate tokens selected by the top - nσ method remains unchanged at different temperatures, in contrast to existing methods (such as top - p and min - p) which introduce more noisy tokens at high temperatures. - **Exploration control**: Through the decoupling of parameters n and temperature, the top - nσ method provides a more refined control of the sampling process. Parameter n determines the boundary between valid tokens and noisy tokens, while the temperature parameter regulates the exploration strategy within the validation token space. In conclusion, by proposing the top - nσ method, this paper solves the shortcomings of existing sampling methods on tasks requiring precise reasoning, especially the performance degradation at high temperatures, providing a new solution for text generation in large - language models.