Data Contamination Through the Lens of Time

Manley Roberts,Himanshu Thakur,Christine Herlihy,Colin White,Samuel Dooley
2023-10-17
Abstract:Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.
Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and analyze the data contamination problem faced by large language models (LLMs) when evaluating benchmark datasets. Specifically, the researchers are concerned with: 1. **Data contamination**: When LLMs are exposed to examples that are similar or identical to those in the evaluation benchmark during the training process, it may lead to an overestimation of their performance on these benchmarks. This phenomenon not only affects the accurate assessment of model performance but may also bring security and privacy risks. 2. **Memorization**: LLMs may remember certain specific examples during the training process and directly reproduce these examples during evaluation, rather than solving problems through real reasoning ability. This also distorts the assessment of the model's actual capabilities. To gain a deeper understanding of these issues, the author uses the known training cut - off dates of the GPT series models as a natural experiment and analyzes two longitudinal code/math problem - solving datasets - Codeforces and Project Euler. The characteristics of these two datasets are that their problem releases span multiple time points, with some problems released before the GPT model training cut - off date and some after. In this way, the author can naturally distinguish which problems the model may have been exposed to and which ones it has not, thereby more accurately assessing the impacts of data contamination and memorization. ### Main contributions of the research - **First large - scale longitudinal analysis**: The author conducts the first large - scale, longitudinal data contamination and memorization analysis, using the method of natural experiment to evaluate the performance of LLMs in different time periods. - **Empirical findings**: The research results show that for problems released before the GPT training cut - off date, there is a significant positive correlation between the performance of LLMs and their popularity on GitHub; while for problems released after the cut - off date, this correlation disappears, indicating a significant data contamination phenomenon. - **Open - source code and datasets**: To promote follow - up research, the author open - sources all the code and original results used to construct the datasets and conduct the analysis, providing a solid foundation for future research. ### Method overview The author conducts the research through the following steps: 1. **Independent variables**: - **GitHub Presence**: Measures the frequency of a problem's appearance on GitHub, serving as a proxy indicator for measuring its public availability. - **Difficulty**: Measures the difficulty of a problem. - **Problem released post - cutoff**: A boolean variable indicating whether a problem is released before or after the training cut - off date. 2. **Dependent variables**: - **Problem - level pass rate**: Measures whether the generated code can correctly pass the test cases. - **Title reproduction**: Measures whether the LLM can correctly generate the title of a problem. - **Tag reproduction**: Measures whether the LLM can correctly generate the descriptive tags of a problem. 3. **Regression analysis**: Through regression analysis, the author evaluates the influence of different variables on the performance of the LLM, especially the changes before and after the training cut - off date. ### Main findings - **Influence of GitHub Presence**: For problems released before the training cut - off date, there is a significant positive correlation between GitHub Presence and the pass rate of the LLM; while for problems released after the cut - off date, this correlation disappears, indicating a significant data contamination and/or memorization phenomenon. - **Influence of difficulty**: Whether before or after the training cut - off date, there is always a negative correlation between difficulty and pass rate, but the negative correlation is weakened in the latter case. Overall, this study shows that the current LLM evaluation benchmarks may have serious data contamination problems, which affect our accurate assessment of the model's true capabilities. The researchers call for more strict and transparent methods when releasing evaluation benchmarks in the future to ensure the fairness and accuracy of the evaluation.