Abstract:Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the potential data contamination problem in the evaluation benchmarks of large - language models (LLMs). Specifically, the researchers are concerned with: 1. **The impact of data contamination on model performance evaluation**: Observations in recent years have shown that there is a difference between high scores on evaluation benchmarks and the actual performance of models, which has raised concerns about data contamination in evaluation benchmarks. Especially in closed - source and some open - source models, this problem is particularly prominent due to the lack of transparency in training data. 2. **Methods for detecting data contamination**: To meet this challenge, the authors propose two methods to detect data contamination in open - source and proprietary LLMs: - **Retrieval - based system**: Identify possible data contamination by exploring the potential overlap between the evaluation benchmark and the pre - training corpus. - **Test - set Slot - Guessing (TS - Guessing) protocol**: By hiding wrong options in multiple - choice questions or hiding key words in sentences, let the model try to fill in these blanks to detect whether it has been exposed to the evaluation benchmark data. 3. **Evaluating the effectiveness of existing benchmarks**: The researchers hope to reveal whether there is potential data contamination in currently widely - used evaluation benchmarks through these methods, and emphasize the need to develop more robust evaluation methods and benchmarks. ### Main findings - On the MMLU benchmark, ChatGPT and GPT - 4 showed accurate matching rates of 52% and 57% respectively, and were able to guess the missing options. - The fully - contaminated ChatGPT had an accurate matching rate close to 100% on the MMLU, further validating the sensitivity of this method in detecting data contamination. - There is no significant difference in TS - Guessing performance among different versions of LLMs within the same company. For example, GPT - 4 only increased by 1% compared to ChatGPT. ### Conclusion The authors call on the research community to pay attention to and solve the data contamination problem in evaluation benchmarks, especially when these benchmarks become publicly available. The research results provide more evidence for the possible existence of data contamination in currently widely - used evaluation benchmarks, and emphasize the importance of developing more reliable evaluation methods.

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Benchmark Data Contamination of Large Language Models: A Survey

Data Contamination Through the Lens of Time

An Open Source Data Contamination Report for Large Language Models

Training on the Benchmark Is Not All You Need

Data Contamination Can Cross Language Barriers

Benchmarking Benchmark Leakage in Large Language Models

PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Contamination Report for Multilingual Benchmarks

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Don't Make Your LLM an Evaluation Benchmark Cheater

Investigating Data Contamination for Pre-training Language Models

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark