Abstract:The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the issue of inadvertently including evaluation data in the pre - training corpus in natural language processing (NLP), namely the data contamination problem. Such data contamination can lead to biases in the performance evaluation results of large - language models (LLMs) on specific tasks or benchmark tests, thus being unable to accurately reflect the true generalization ability of the models. By organizing a shared task, the paper collected evidence of data contamination in currently available datasets and models, aiming to help the research community understand the severity of this problem and assist researchers in avoiding reporting evaluation results on known contaminated data resources. Specifically, the paper focuses on the following aspects: 1. **Definition and Background**: The paper first defines the concept of data contamination, that is, the evaluation data is included in the pre - training corpus, resulting in inaccurate evaluation results. With the increase in model size and data volume, as well as the application of large - scale web crawlers, this contamination has become more and more common and difficult to detect. 2. **Methodology**: The paper introduces the methodology for collecting evidence of data contamination, including data - based methods (such as string - matching techniques) and model - based methods (such as membership - inference attacks). These methods are used to detect whether the evaluation data is included in the pre - training data. 3. **Database Construction**: The paper describes the construction process of a structured, centralized public database, which is used to collect and organize contamination evidence from the community. Each record in the database reports in detail the proportion of contaminated data and provides information on the pollution source and the evaluation dataset. 4. **Data Analysis**: The paper analyzes the data in the database, including the contamination proportions of different task types, the distribution of contaminated datasets and models, and the temporal distribution characteristics of contaminated data. These analyses are helpful for understanding the current situation and development trend of data contamination in the NLP field. 5. **Conclusions and Prospects**: Finally, the paper summarizes the impact of data contamination on NLP research, emphasizes the importance of continuous monitoring and updating of the contamination database, and encourages community members to continue contributing new contamination evidence. In short, through systematic data collection and analysis, this paper reveals the severity of data contamination in the NLP field and provides a tool for the research community to better understand and address this challenge.

Data Contamination Report from the 2024 CONDA Shared Task

An Open Source Data Contamination Report for Large Language Models

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Investigating Data Contamination for Pre-training Language Models

A Taxonomy for Data Contamination in Large Language Models

Data Contamination Can Cross Language Barriers

Data Contamination Through the Lens of Time

Contamination Report for Multilingual Benchmarks

Hierarchical Semi-supervised Contrastive Learning for Contamination-Resistant Anomaly Detection

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Toxicity of the Commons: Curating Open-Source Pre-Training Data

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Investigating Data Contamination in Modern Benchmarks for Large Language Models

CAP: Data Contamination Detection via Consistency Amplification