Contamination Report for Multilingual Benchmarks

Sanchit Ahuja,Varun Gumma,Sunayana Sitaram
2024-10-22
Abstract:Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether $7$ frequently used multilingual benchmarks are contaminated in $7$ popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the contamination of multilingual benchmark datasets in large language models (LLMs). Specifically, the paper explores whether test datasets exist in the pre - training or fine - tuning data of multilingual LLMs. This contamination can lead to inflated model scores on benchmark tests, thus affecting the true assessment of model capabilities. The researchers used the "black - box testing" method to detect whether 7 commonly used multilingual benchmark datasets were contaminated in 7 popular open - source and closed - source LLMs, and found that almost all of the tested models showed signs of contamination to varying degrees. This finding helps the research community determine the best benchmark sets for multilingual evaluation. ### Main contributions of the paper: 1. **Analysis of contamination situation**: By using the contamination detection technique proposed by Oren et al., analyze which multilingual benchmark datasets are contaminated in which models. 2. **Experimental design**: Describe the experimental method in detail, including the models used, datasets, and specific experimental steps. 3. **Result discussion**: Provide detailed experimental results and discuss the impact of these results on multilingual model evaluation. ### Key methods: - **Black - box testing**: Utilize the exchangeability property of datasets. By comparing the performance of the model on the original - order and randomly shuffled - order datasets, detect whether the model has a statistically significant preference for the original order. If there is a significant preference, the dataset is considered contaminated. ### Main findings: - **Widespread contamination**: Almost all of the tested models showed signs of contamination to multiple benchmark datasets. - **Newer model versions are more vulnerable to contamination**: Although newer model versions are larger and have more training data, they are more likely to include benchmark test datasets. - **Contamination stage**: Contamination mainly occurs during the pre - training stage and still exists after fine - tuning. ### Future work directions: - **Extended analysis**: Plan to evaluate more datasets and models to further understand the universality and severity of the contamination problem. - **Prevent contamination**: Propose possible methods and techniques to reduce or prevent dataset contamination in future model training processes. ### Importance: - **Evaluation authenticity**: Ensure the authenticity and reliability of the multilingual model's ability evaluation, and avoid false high scores caused by data contamination. - **Resource optimization**: The creation of multilingual datasets is costly and scarce. Detecting and preventing contamination helps optimize resource use. By solving these problems, the researchers hope to provide more reliable and effective benchmark sets for future multilingual model evaluation.