Abstract:Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9\% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the data contamination problem faced by large - language models (LLMs) when evaluating their performance. Specifically, due to the large amount and wide sources of training data of LLMs, test data may be inadvertently included, causing the model to perform exceptionally well on these leaked test data. This phenomenon will not only overestimate the actual performance of the model, but also affect the credibility and effectiveness of the model, making it difficult to determine whether the excellent performance of the model is due to the generalization ability achieved by truly understanding the task or the memory effect formed by seeing the test data. To address this challenge, the author proposes two methods: 1. **CDD (Contamination Detection via output Distribution)**: Identify data contamination by detecting the spikiness of the output distribution of LLMs. CDD can work only by sampling text, without the need to access the output probabilities of the model or the training data. 2. **TED (Trustworthy Evaluation via output Distribution)**: Mitigate the impact of data contamination on evaluation by correcting the output distribution of LLMs. TED is designed to reduce the performance improvement brought by data contamination during the evaluation process, thereby providing more reliable evaluation results. In addition, the author also constructs two new datasets, **DETCON** and **COMIEVAL**, which are used for data contamination detection and contamination mitigation evaluation tasks respectively. The experimental results show that CDD is significantly superior to other data contamination detection methods on multiple metrics, and TED can effectively mitigate the impact of data contamination on model performance, especially under different contamination level settings. The paper also shows that in practical applications, ChatGPT has a relatively high data contamination risk on the HumanEval benchmark, and this contamination situation may become more serious over time. By using CDD and TED, the data contamination problem can be detected and mitigated more effectively, improving the credibility of model evaluation.

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Benchmark Data Contamination of Large Language Models: A Survey

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Data Contamination Can Cross Language Barriers

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

An Open Source Data Contamination Report for Large Language Models

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

Data Contamination Through the Lens of Time

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Investigating Data Contamination for Pre-training Language Models

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

CAP: Data Contamination Detection via Consistency Amplification

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

A Taxonomy for Data Contamination in Large Language Models

Task Contamination: Language Models May Not Be Few-Shot Anymore

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models