Abstract:Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9\% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.

IterClean: an Iterative Data Cleaning Framework with Large Language Models

Data Cleaning Using Large Language Models

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

LLM-Assisted Code Cleaning For Training Accurate Code Generators

A Hybrid Data Cleaning Framework Using Markov Logic Networks

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models

Automated Data Curation for Robust Language Model Fine-Tuning

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse

BoostClean: Automated Error Detection and Repair for Machine Learning

Batchwise Probabilistic Incremental Data Cleaning

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models

RetClean: Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

Large Language Model Can Continue Evolving From Mistakes

FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?