Abstract:Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

Data Contamination Calibration for Black-box LLMs

CAP: Data Contamination Detection via Consistency Amplification

Data Contamination Can Cross Language Barriers

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

Data Contamination Through the Lens of Time

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

Clean Evaluations on Contaminated Visual Language Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Efficient Label Contamination Attacks Against Black-Box Learning Models.

Benchmark Data Contamination of Large Language Models: A Survey

A Survey of Calibration Process for Black-Box LLMs