Abstract:Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the impact of data contamination (i.e., data leakage in the evaluation dataset into the pre - training corpus) on the performance of large - scale language models (LLMs). Specifically, the author focuses on how such contamination affects the performance of language models on downstream tasks, especially the problem of data contamination during the pre - training stage. The following are the core questions of the paper: 1. **RQ1: How are language models affected by various forms of intentional contamination in the pre - training corpus?** - To answer this question, the author introduced intentional contamination (including text contamination and contamination with true labels) into the pre - training corpus, then pre - trained the GPT - 2 small model from scratch, and evaluated the performance of these models under different contamination conditions. In addition, they also carried out extended experiments using the GPT - 2 large model to evaluate the performance changes of larger - scale models. 2. **RQ2: How does the number of repetitions of evaluation data in the pre - training corpus affect model performance?** - The author studied the impact of the number of data repetitions on model performance by injecting evaluation data into the pre - training corpus multiple times. The experimental results show that as the number of contaminations increases, the model performance presents a U - shaped curve, that is, the performance first increases and then decreases. 3. **RQ3: How effective is the n - gram - based contamination definition used in recent LLM reports?** - The author systematically filtered out different proportions of training documents marked as "contaminated" and re - pre - trained the models to evaluate the effectiveness of these definitions. The results indicate that the existing n - gram - based contamination definitions are insufficient and cannot accurately identify effective contamination. ### Main Findings - **Impact of Data Contamination**: - Whether it is text contamination or contamination with true labels, it will significantly improve the performance of the model on certain tasks, especially tasks that require understanding evaluation prompts (such as the CNN and SQuAD datasets). - However, for simple text classification tasks (such as SST - 2), contamination with true labels is not always better than text contamination. - **U - shaped Effect of Repeated Contamination**: - The impact of the number of repetitions of data contamination on model performance is a U - shaped curve. In the initial stage, increasing the number of contaminations will improve performance, but after exceeding a certain threshold, the performance begins to decline. - **Limitations of Existing Contamination Definitions**: - The n - gram - based contamination definition has a high false positive rate and false negative rate and cannot effectively identify actual contamination. After removing the documents considered as "contaminated", the model performance did not decrease significantly, which indicates the deficiencies of the existing definitions. ### Conclusion The paper proves through experiments that data contamination has a significant impact on the performance of language models, especially during the pre - training stage. However, the existing contamination detection methods have limitations, and more stringent evaluation methods are required to ensure the robustness of the models. These findings emphasize the importance of independent and comprehensive contamination evaluation in LLM research.

Investigating Data Contamination for Pre-training Language Models

A Taxonomy for Data Contamination in Large Language Models

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

An Open Source Data Contamination Report for Large Language Models

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Data Contamination Can Cross Language Barriers

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Data Contamination Through the Lens of Time

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Task Contamination: Language Models May Not Be Few-Shot Anymore

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Assessing Contamination in Large Language Models: Introducing the LogProber method

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Benchmark Data Contamination of Large Language Models: A Survey

Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation