Benchmark Data Contamination of Large Language Models: A Survey

Cheng Xu,Shuhao Guan,Derek Greene,M-Tahar Kechadi

2024-06-07

Abstract:The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.

Computation and Language

What problem does this paper attempt to address?

This paper focuses on a problem called Benchmark Data Contamination (BDC) in large language models (LLMs). When LLMs unintentionally encounter evaluation benchmark information during the training process, it may distort or make the evaluation stage unreliable. This problem affects the accurate assessment of LLMs' performance and may lead to an overestimation of the capabilities of these models, raising doubts about their reliability in real-world applications. The paper reviews the complexity of BDC in LLM evaluation and explores alternative evaluation methods to mitigate the risks associated with traditional benchmarks. The study also analyzes the challenges and future directions in mitigating BDC risks, emphasizing the complexity of the problem and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications. The paper's structure includes an introduction to LLMs' background, definition, and discussion of the BDC problem, as well as the sources and impacts of BDC. Additionally, it provides a detailed description of the potential problems BDC may pose in different NLP tasks, such as code generation, machine translation, question answering, sentiment analysis, etc. Finally, the paper outlines the techniques for detecting BDC, categorizing them into matching-based and comparison-based methods. In summary, this paper aims to foster understanding of the BDC problem and provide insights for addressing this crucial challenge by offering a comprehensive investigation into BDC.

Benchmark Data Contamination of Large Language Models: A Survey

Data Contamination Through the Lens of Time

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Benchmarking Benchmark Leakage in Large Language Models

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

Don't Make Your LLM an Evaluation Benchmark Cheater

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Data Contamination Can Cross Language Barriers

Contamination Report for Multilingual Benchmarks

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Safety Assessment of Chinese Large Language Models

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

SafetyBench: Evaluating the Safety of Large Language Models

An Open Source Data Contamination Report for Large Language Models

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions