Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Chunyuan Deng,Yilun Zhao,Yuzhao Heng,Yitong Li,Jiannan Cao,Xiangru Tang,Arman Cohan
2024-06-21
Abstract:Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the data contamination problem that occurs during the training process of large - scale language models (LLMs). Data contamination refers to the unintentional or intentional inclusion of evaluation or benchmark test data during the model training stage, resulting in inflated performance scores of the model on these benchmarks. This phenomenon is particularly prominent in large - scale language models that use Internet data as a training corpus, because these models may inadvertently include data instances from evaluation benchmarks, thus affecting the true evaluation of the model's generalization ability for new tasks. Specifically, the paper focuses on the following aspects: 1. **Impact of data contamination**: Research how data contamination affects the performance of the model on downstream tasks, and explore the relationship between contaminated data, the model's memory ability, and downstream task performance. 2. **Methods for detecting data contamination**: Analyze existing methods for detecting data contamination, and discuss the focus, assumptions, advantages, and limitations of these methods by classification. 3. **Strategies for mitigating data contamination**: Discuss strategies for mitigating data contamination and provide clear guidelines for future research. Through these studies, the paper aims to provide researchers in the field of natural language processing (NLP) with an in - depth and systematic understanding of the data contamination problem, thereby improving the fairness and accuracy of evaluation.