Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin,Mihai Surdeanu
2024-02-22
Abstract:Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.
Computation and Language,Artificial Intelligence,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of data contamination in large language models (LLMs). Specifically, data contamination refers to the presence of test data for downstream tasks in the pre-training data of LLMs, which can lead to distorted evaluations of LLMs' actual performance on other tasks. The authors propose a simple yet effective method to identify data contamination in LLMs. ### Method Overview 1. **Instance-Level Contamination Detection**: - **Prompt Instruction**: By providing a prompt that includes the dataset name, partition type, and a randomly lengthened initial paragraph of the reference instance, the LLM is asked to complete the instance. - **Evaluation Method**: If the LLM's output matches the latter part of the reference instance completely or almost completely, the instance is marked as contaminated. 2. **Partition-Level Contamination Detection**: - **Average Overlap Score**: If the average overlap score (measured using ROUGE-L or BLEURT metrics) of completions generated using the prompt instruction is significantly higher than that of completions generated using general instructions, the partition is considered contaminated. - **GPT-4 Classifier**: If a few-shot in-context learning (ICL) classifier based on GPT-4 marks at least one generated completion as a complete match, or at least two generated completions as near-complete matches, the partition is considered contaminated. ### Main Contributions 1. **Proposed a novel and cost-effective data contamination detection method**, suitable for scenarios lacking access to pre-training data and with limited computational resources. 2. **Evaluated the proposed method in 28 different scenarios**, involving two state-of-the-art LLMs (GPT-3.5 and GPT-4) and seven datasets (including classification, summarization, and natural language inference tasks). Results show that the best method's accuracy ranges from 92% to 100%. 3. **Analysis indicates evidence of contamination in the test partitions of the AG News, WNLI, and XSum datasets for GPT-4**, supporting the view that data contamination is a serious issue. ### Related Work Despite the obvious importance of the data contamination issue, related research is relatively scarce. Some existing studies mainly focus on data memorization. Early research on LLMs' data contamination primarily relied on higher-order n-grams to detect overlapping content between pre-training data and evaluation datasets. However, these methods typically require access to pre-training data or substantial computational resources, limiting their applicability. The method proposed in this paper aims to overcome these limitations, making it effective in detecting data contamination even without access to pre-training data and with limited computational resources.