Esteban Garces Arias,Julian Rodemann,Meimingwei Li,Christian Heumann,Matthias Aßenmacher
Abstract:Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, $k-$sampling, nucleus $p-$sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the challenge of generating high-quality text in natural language generation (NLP) tasks. Although large language models perform well on many tasks, generating coherent, diverse, and human-like text remains a challenge. Existing decoding strategies, such as beam search, sampling with temperature, top-k sampling, nucleus sampling, typical decoding, contrastive decoding, and contrastive search, have improved text coherence and diversity to some extent but still have limitations.
Specifically, these methods often lead to generated text that is repetitive or lacks creativity. To overcome these issues, the authors propose Adaptive Contrastive Search (ACS), a new decoding strategy that enhances creativity and diversity while maintaining coherence by automatically adjusting the degeneration penalty based on the model's uncertainty at each generation step.
### Main Contributions
1. **Proposing Adaptive Contrastive Search (ACS)**: Based on the work of Su et al. (2022), ACS automatically adjusts the number of candidate words and the degeneration penalty by measuring the model's uncertainty at each time step.
2. **Comprehensive Experimental Comparison**: Comparing ACS with various existing decoding methods (such as nucleus sampling, contrastive decoding, and contrastive search) to validate its performance in open-ended text generation tasks.
3. **New Insights into the MAUVE Metric**: Exploring the correlation between the MAUVE metric and human judgment, emphasizing the need for a more robust evaluation metric to better reflect human preferences.
4. **Open Source Code and Datasets**: Providing code, datasets, and models to facilitate further research.
### Methodology
The core of ACS lies in dynamically adjusting decoding parameters based on the model's uncertainty. The specific steps include:
1. **Measuring Uncertainty**: Calculating the entropy \( H(X)(t) \) of the output distribution.
2. **Centering**: Subtracting the median entropy of the previous prediction.
3. **Scaling**: Dividing by the maximum entropy to obtain a relative measure.
4. **Calculating**: Passing the centered and scaled entropy values through a Sigmoid function to obtain the values of \( \alpha_t \) and \( k_t \).
### Experimental Setup
1. **Evaluation Metrics**: Automatically evaluating the quality of generated text using three metrics: Diversity, MAUVE, and Coherence.
2. **Datasets**: Evaluating in three domains: news, Wikipedia articles, and stories, using the Wikinews, WikiText-103, and BookCorpus datasets, respectively.
3. **Baseline Models**: Comparing with methods such as greedy search, beam search, top-k sampling, nucleus sampling, typical decoding, contrastive decoding, and contrastive search with fixed parameters.
4. **Models**: Using three different sizes of GPT-2 models (gpt2-xl, gpt2-large, and gpt2-medium) to explore the impact of model size on ACS performance.
### Experimental Results
1. **Automatic Evaluation**: ACS performs well in terms of diversity, MAUVE, and coherence, especially across different datasets and model sizes.
2. **Human Evaluation**: Human evaluation results show that ACS outperforms non-adaptive methods in terms of fluency and coherence.
3. **Ablation Study**: By adjusting the temperature parameter \( q \), it is found that ACS is sensitive to different \( q \) values; increasing \( q \) can improve diversity but reduce coherence.
4. **Generation Speed**: The generation speed of ACS is slightly lower than that of contrastive search with fixed parameters but remains within an acceptable range.
5. **Multilingual Evaluation**: Evaluations in eight different languages show that ACS maintains good performance in most languages.
### Conclusion
ACS enhances creativity and diversity while maintaining coherence by adaptively adjusting the decoding parameters based on the model's uncertainty at each generation step.