Abstract:Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether $7$ frequently used multilingual benchmarks are contaminated in $7$ popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the contamination of multilingual benchmark datasets in large language models (LLMs). Specifically, the paper explores whether test datasets exist in the pre - training or fine - tuning data of multilingual LLMs. This contamination can lead to inflated model scores on benchmark tests, thus affecting the true assessment of model capabilities. The researchers used the "black - box testing" method to detect whether 7 commonly used multilingual benchmark datasets were contaminated in 7 popular open - source and closed - source LLMs, and found that almost all of the tested models showed signs of contamination to varying degrees. This finding helps the research community determine the best benchmark sets for multilingual evaluation. ### Main contributions of the paper: 1. **Analysis of contamination situation**: By using the contamination detection technique proposed by Oren et al., analyze which multilingual benchmark datasets are contaminated in which models. 2. **Experimental design**: Describe the experimental method in detail, including the models used, datasets, and specific experimental steps. 3. **Result discussion**: Provide detailed experimental results and discuss the impact of these results on multilingual model evaluation. ### Key methods: - **Black - box testing**: Utilize the exchangeability property of datasets. By comparing the performance of the model on the original - order and randomly shuffled - order datasets, detect whether the model has a statistically significant preference for the original order. If there is a significant preference, the dataset is considered contaminated. ### Main findings: - **Widespread contamination**: Almost all of the tested models showed signs of contamination to multiple benchmark datasets. - **Newer model versions are more vulnerable to contamination**: Although newer model versions are larger and have more training data, they are more likely to include benchmark test datasets. - **Contamination stage**: Contamination mainly occurs during the pre - training stage and still exists after fine - tuning. ### Future work directions: - **Extended analysis**: Plan to evaluate more datasets and models to further understand the universality and severity of the contamination problem. - **Prevent contamination**: Propose possible methods and techniques to reduce or prevent dataset contamination in future model training processes. ### Importance: - **Evaluation authenticity**: Ensure the authenticity and reliability of the multilingual model's ability evaluation, and avoid false high scores caused by data contamination. - **Resource optimization**: The creation of multilingual datasets is costly and scarce. Detecting and preventing contamination helps optimize resource use. By solving these problems, the researchers hope to provide more reliable and effective benchmark sets for future multilingual model evaluation.

Contamination Report for Multilingual Benchmarks

Data Contamination Can Cross Language Barriers

An Open Source Data Contamination Report for Large Language Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Data Contamination Through the Lens of Time

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Benchmark Data Contamination of Large Language Models: A Survey

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Proving Test Set Contamination in Black Box Language Models

Investigating Data Contamination for Pre-training Language Models

A Taxonomy for Data Contamination in Large Language Models

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

CAP: Data Contamination Detection via Consistency Amplification