Benchmark Data Contamination of Large Language Models: A Survey

Cheng Xu,Shuhao Guan,Derek Greene,M-Tahar Kechadi
2024-06-07
Abstract:The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
Computation and Language
What problem does this paper attempt to address?
This paper focuses on a problem called Benchmark Data Contamination (BDC) in large language models (LLMs). When LLMs unintentionally encounter evaluation benchmark information during the training process, it may distort or make the evaluation stage unreliable. This problem affects the accurate assessment of LLMs' performance and may lead to an overestimation of the capabilities of these models, raising doubts about their reliability in real-world applications. The paper reviews the complexity of BDC in LLM evaluation and explores alternative evaluation methods to mitigate the risks associated with traditional benchmarks. The study also analyzes the challenges and future directions in mitigating BDC risks, emphasizing the complexity of the problem and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications. The paper's structure includes an introduction to LLMs' background, definition, and discussion of the BDC problem, as well as the sources and impacts of BDC. Additionally, it provides a detailed description of the potential problems BDC may pose in different NLP tasks, such as code generation, machine translation, question answering, sentiment analysis, etc. Finally, the paper outlines the techniques for detecting BDC, categorizing them into matching-based and comparison-based methods. In summary, this paper aims to foster understanding of the BDC problem and provide insights for addressing this crucial challenge by offering a comprehensive investigation into BDC.