Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Chunyuan Deng,Yilun Zhao,Yuzhao Heng,Yitong Li,Jiannan Cao,Xiangru Tang,Arman Cohan

2024-06-21

Abstract:Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the data contamination problem that occurs during the training process of large - scale language models (LLMs). Data contamination refers to the unintentional or intentional inclusion of evaluation or benchmark test data during the model training stage, resulting in inflated performance scores of the model on these benchmarks. This phenomenon is particularly prominent in large - scale language models that use Internet data as a training corpus, because these models may inadvertently include data instances from evaluation benchmarks, thus affecting the true evaluation of the model's generalization ability for new tasks. Specifically, the paper focuses on the following aspects: 1. **Impact of data contamination**: Research how data contamination affects the performance of the model on downstream tasks, and explore the relationship between contaminated data, the model's memory ability, and downstream task performance. 2. **Methods for detecting data contamination**: Analyze existing methods for detecting data contamination, and discuss the focus, assumptions, advantages, and limitations of these methods by classification. 3. **Strategies for mitigating data contamination**: Discuss strategies for mitigating data contamination and provide clear guidelines for future research. Through these studies, the paper aims to provide researchers in the field of natural language processing (NLP) with an in - depth and systematic understanding of the data contamination problem, thereby improving the fairness and accuracy of evaluation.

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Investigating Data Contamination for Pre-training Language Models

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Data Contamination Can Cross Language Barriers

A Taxonomy for Data Contamination in Large Language Models

An Open Source Data Contamination Report for Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Data Contamination Through the Lens of Time

Benchmark Data Contamination of Large Language Models: A Survey

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

CAP: Data Contamination Detection via Consistency Amplification

Evading Data Contamination Detection for Language Models is (too) Easy

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination