Data Contamination Report from the 2024 CONDA Shared Task

Oscar Sainz,Iker García-Ferrero,Alon Jacovi,Jon Ander Campos,Yanai Elazar,Eneko Agirre,Yoav Goldberg,Wei-Lin Chen,Jenny Chim,Leshem Choshen,Luca D'Amico-Wong,Melissa Dell,Run-Ze Fan,Shahriar Golchin,Yucheng Li,Pengfei Liu,Bhavish Pahwa,Ameya Prabhu,Suryansh Sharma,Emily Silcock,Kateryna Solonko,David Stap,Mihai Surdeanu,Yu-Min Tseng,Vishaal Udandarao,Zengzhi Wang,Ruijie Xu,Jinglin Yang
2024-08-04
Abstract:The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the issue of inadvertently including evaluation data in the pre - training corpus in natural language processing (NLP), namely the data contamination problem. Such data contamination can lead to biases in the performance evaluation results of large - language models (LLMs) on specific tasks or benchmark tests, thus being unable to accurately reflect the true generalization ability of the models. By organizing a shared task, the paper collected evidence of data contamination in currently available datasets and models, aiming to help the research community understand the severity of this problem and assist researchers in avoiding reporting evaluation results on known contaminated data resources. Specifically, the paper focuses on the following aspects: 1. **Definition and Background**: The paper first defines the concept of data contamination, that is, the evaluation data is included in the pre - training corpus, resulting in inaccurate evaluation results. With the increase in model size and data volume, as well as the application of large - scale web crawlers, this contamination has become more and more common and difficult to detect. 2. **Methodology**: The paper introduces the methodology for collecting evidence of data contamination, including data - based methods (such as string - matching techniques) and model - based methods (such as membership - inference attacks). These methods are used to detect whether the evaluation data is included in the pre - training data. 3. **Database Construction**: The paper describes the construction process of a structured, centralized public database, which is used to collect and organize contamination evidence from the community. Each record in the database reports in detail the proportion of contaminated data and provides information on the pollution source and the evaluation dataset. 4. **Data Analysis**: The paper analyzes the data in the database, including the contamination proportions of different task types, the distribution of contaminated datasets and models, and the temporal distribution characteristics of contaminated data. These analyses are helpful for understanding the current situation and development trend of data contamination in the NLP field. 5. **Conclusions and Prospects**: Finally, the paper summarizes the impact of data contamination on NLP research, emphasizes the importance of continuous monitoring and updating of the contamination database, and encourages community members to continue contributing new contamination evidence. In short, through systematic data collection and analysis, this paper reveals the severity of data contamination in the NLP field and provides a tool for the research community to better understand and address this challenge.