Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Aaditya K. Singh,Muhammed Yusuf Kocyigit,Andrew Poulton,David Esiobu,Maria Lomeli,Gergely Szilvasy,Dieuwke Hupkes
2024-11-06
Abstract:Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales. We also find that considering only the longest contaminated substring provides a better signal than considering a union of all contaminated substrings, and that doing model and benchmark specific threshold analysis greatly increases the specificity of the results. Lastly, we investigate the impact of hyperparameter choices, finding that, among other things, both using larger values of n and disregarding matches that are infrequent in the pre-training data lead to many false negatives. With ConTAM, we provide a method to empirically ground evaluation data contamination metrics in downstream effects. With our exploration, we shed light on how evaluation data contamination can impact LLMs and provide insight into the considerations important when doing contamination analysis. We end our paper by discussing these in more detail and providing concrete suggestions for future work.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the impact of evaluation data contamination in large language models (LLMs) and its measurement methods. Specifically, evaluation data contamination refers to the accidental inclusion of samples from the evaluation benchmark in the pre - training corpus. This phenomenon of "training on the test set" makes the evaluation benchmark scores difficult to interpret. The paper points out that although evaluation data contamination is intuitively easy to understand, it is very difficult to precisely define which samples should be considered contaminated and the extent of the impact of contamination on the benchmark scores. To solve these problems, the authors propose a new analysis method, called ConTAM (Contamination Threshold Analysis Method), and use this method to better understand evaluation data contamination and its impact on benchmark scores. By applying existing and newly proposed n - gram - based contamination metrics on 13 benchmarks and 7 models of different scales, the authors show how ConTAM can be used to evaluate the effectiveness of contamination metrics and provide specific suggestions and findings. ### Main Conclusions: 1. **Underestimation of the Impact of Contamination**: The impact of evaluation data contamination has been underestimated in many well - known LLM releases, which may be due to false negatives in the selected contamination metrics. 2. **Longest Contaminated Substring is Superior to All Matched Substrings**: In most cases, considering the longest contaminated substring can detect a more meaningful performance gain (EPG) than considering all matched substrings. 3. **Smaller n - values are Better**: For almost all benchmarks considered, smaller n - values are better, and even a single occurrence in the pre - training data can affect the model. 4. **The Impact of Contamination Varies with Model Scale**: Larger - scale models can make better use of contamination when there is still room for performance improvement. 5. **Model - specific Threshold Selection**: In order to find the most appropriate contamination metric, it is important to perform model - specific threshold selection. ### Method Overview: - **Contamination Metrics**: The paper studies four contamination metric methods, namely NGRAM - MATCH, TOKEN - MATCH, TOKEN - EXTEND, and LONGEST - MATCH. These methods calculate the contamination score of a sample in different ways, ranging from 0 (completely uncontaminated) to 1 (completely contaminated). - **Estimated Performance Gain (EPG)**: The impact of contamination is quantified by comparing the performance differences of the model on the complete benchmark and the subset marked as "clean". - **Threshold Selection**: The z - score method is used to select the optimal contamination threshold to reduce false positives and improve the reliability of the results. - **ConTAM Graph**: The effectiveness of different contamination metric methods is compared by plotting the relationship between EPG and the percentage of data marked as contaminated. ### Conclusion: The paper provides a systematic method to evaluate and understand the impact of evaluation data contamination through the ConTAM method, providing important references and suggestions for future research and practice.