An Open Source Data Contamination Report for Large Language Models

Yucheng Li,Frank Guerin,Chenghua Lin
DOI: https://doi.org/10.48550/arXiv.2310.17589
2024-01-29
Abstract:Data contamination in model evaluation has become increasingly prevalent with the growing popularity of large language models. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has become an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by large language model developers and often lacks transparency and completeness. This paper presents an extensive data contamination report for over 15 popular large language models across six popular multiple-choice QA benchmarks. We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models. Our experiments reveal varying contamination levels ranging from 1\% to 45\% across benchmarks, with the contamination degree increasing rapidly over time. Performance analysis of large language models indicates that data contamination does not necessarily lead to increased model metrics: while significant accuracy boosts of up to 14\% and 7\% are observed on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is noted on contaminated MMLU. We also find larger models seem able to gain more advantages than smaller models on contaminated test sets.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the issue of data contamination in the evaluation of large language models (LLMs). Specifically: 1. **Data Contamination Phenomenon**: With the proliferation of large language models, data contamination has become an increasingly serious problem. Data contamination refers to the inadvertent presence of test set samples in the training data, allowing models to "cheat" by memorization rather than true capability. This phenomenon not only misleads the assessment of model performance but also distorts comparisons between different models and undermines the utility of benchmark tests. 2. **Inadequacies of Existing Contamination Analyses**: Current data contamination analyses are typically conducted internally by the developers of large language models, lacking transparency and completeness. For example, OpenAI's contamination study on GPT-4 only covered pre-training data, neglecting subsequent fine-tuning stages; Llama-2 reported contamination statistics for only 2 out of the 20+ benchmarks used in its evaluation, with unclear details on the implementation of contamination identification. 3. **Providing Transparent Contamination Analysis**: To enhance transparency and credibility, this paper provides an open-source data contamination report, covering 15 popular large language models on six common multiple-choice question-answering benchmarks. Additionally, it introduces an open-source pipeline that enables the community to conduct contamination analysis on custom data and models. ### Main Findings 1. **Variation in Contamination Levels**: Contamination levels across different benchmarks range from 1% to 45%, with the degree of contamination increasing rapidly over time. 2. **Impact of Contamination on Model Performance**: While contamination leads to significant performance improvements on some benchmarks (e.g., up to 14% on C-Eval and 7% on Hellaswag), its impact is minimal on others (e.g., MMLU). 3. **Effect of Model Size**: Larger models seem to gain a greater advantage from contaminated data, possibly due to their stronger memorization capabilities. 4. **Difference Between Input and Label Contamination**: Cases where both inputs and labels are contaminated have a greater impact on model performance compared to cases with only input contamination. ### Methodology 1. **Detecting Contaminated Samples**: Contaminated samples are identified by checking if test samples match completely with those in Common Crawl, a commonly used large corpus often included in the pre-training data of language models. 2. **Classification and Evaluation**: Benchmarks are divided into clean and contaminated sets, and models are evaluated on these subsets separately to assess the impact of contamination on evaluation results. 3. **Open-Source Pipeline**: An open-source pipeline is provided, enabling the community and third parties to conduct contamination analysis without needing access to the full training data. ### Conclusion Through detailed analysis and experiments, this paper reveals the prevalence and impact of data contamination in the evaluation of large language models and provides a transparent, reproducible method for detecting and assessing contamination. These findings are crucial for ensuring the reliability and fairness of model evaluations.