Investigating Data Contamination for Pre-training Language Models

Minhao Jiang,Ken Ziyu Liu,Ming Zhong,Rylan Schaeffer,Siru Ouyang,Jiawei Han,Sanmi Koyejo
2024-01-12
Abstract:Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the impact of data contamination (i.e., data leakage in the evaluation dataset into the pre - training corpus) on the performance of large - scale language models (LLMs). Specifically, the author focuses on how such contamination affects the performance of language models on downstream tasks, especially the problem of data contamination during the pre - training stage. The following are the core questions of the paper: 1. **RQ1: How are language models affected by various forms of intentional contamination in the pre - training corpus?** - To answer this question, the author introduced intentional contamination (including text contamination and contamination with true labels) into the pre - training corpus, then pre - trained the GPT - 2 small model from scratch, and evaluated the performance of these models under different contamination conditions. In addition, they also carried out extended experiments using the GPT - 2 large model to evaluate the performance changes of larger - scale models. 2. **RQ2: How does the number of repetitions of evaluation data in the pre - training corpus affect model performance?** - The author studied the impact of the number of data repetitions on model performance by injecting evaluation data into the pre - training corpus multiple times. The experimental results show that as the number of contaminations increases, the model performance presents a U - shaped curve, that is, the performance first increases and then decreases. 3. **RQ3: How effective is the n - gram - based contamination definition used in recent LLM reports?** - The author systematically filtered out different proportions of training documents marked as "contaminated" and re - pre - trained the models to evaluate the effectiveness of these definitions. The results indicate that the existing n - gram - based contamination definitions are insufficient and cannot accurately identify effective contamination. ### Main Findings - **Impact of Data Contamination**: - Whether it is text contamination or contamination with true labels, it will significantly improve the performance of the model on certain tasks, especially tasks that require understanding evaluation prompts (such as the CNN and SQuAD datasets). - However, for simple text classification tasks (such as SST - 2), contamination with true labels is not always better than text contamination. - **U - shaped Effect of Repeated Contamination**: - The impact of the number of repetitions of data contamination on model performance is a U - shaped curve. In the initial stage, increasing the number of contaminations will improve performance, but after exceeding a certain threshold, the performance begins to decline. - **Limitations of Existing Contamination Definitions**: - The n - gram - based contamination definition has a high false positive rate and false negative rate and cannot effectively identify actual contamination. After removing the documents considered as "contaminated", the model performance did not decrease significantly, which indicates the deficiencies of the existing definitions. ### Conclusion The paper proves through experiments that data contamination has a significant impact on the performance of language models, especially during the pre - training stage. However, the existing contamination detection methods have limitations, and more stringent evaluation methods are required to ensure the robustness of the models. These findings emphasize the importance of independent and comprehensive contamination evaluation in LLM research.