NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Oscar Sainz,Jon Ander Campos,Iker García-Ferrero,Julen Etxaniz,Oier Lopez de Lacalle,Eneko Agirre
2023-10-27
Abstract:In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the issue of benchmark datasets in natural language processing (NLP) tasks being contaminated by the training data of large language models (LLMs). Specifically, when a large language model is trained on the test portion of a benchmark dataset and then evaluated on the same benchmark, it leads to an overestimation of the model's performance on the target benchmark and related tasks. This data contamination not only affects the comparative evaluation of model performance but may also result in incorrect scientific conclusions being published while correct conclusions are overlooked. Therefore, the paper calls for action from the academic community, including the development of automatic or semi-automatic methods to detect data contamination and the establishment of a registry system for cases of data contamination to ensure the accuracy and reliability of research.