Data Contamination Can Cross Language Barriers

Feng Yao,Yufan Zhuang,Zihao Sun,Sunan Xu,Animesh Kumar,Jingbo Shang
2024-10-31
Abstract:The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{<a class="link-external link-https" href="https://github.com/ShangDataLab/Deep-Contam" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the potential data contamination problem in the pre - training data of large language models (LLMs). Specifically, the paper focuses on cross - lingual contamination, a form of contamination that can improve the performance of LLMs through translated versions of benchmark datasets while evading existing detection methods. Existing contamination detection methods are usually based on text overlap between training data and evaluation data, which may be too superficial to reflect deeper forms of contamination. ### Main contributions 1. **Identifying cross - lingual contamination**: For the first time, the paper reveals a form of cross - lingual contamination that can significantly improve the performance of LLMs in benchmark tests, but this contamination cannot be detected by existing detection methods. 2. **Redefining the contamination problem**: Redefines the data contamination problem from the perspective of generalization ability and proposes a generalization - based detection method. 3. **Discussing the impact of cross - lingual contamination**: Explores the potential impact of cross - lingual contamination on understanding the working mechanism of LLMs and improving the multilingual ability of LLMs through post - training. ### Method overview #### 3. Injecting cross - lingual contamination - **Translating the benchmark dataset**: Translate the benchmark dataset into non - English languages. - **Customizing questions and options**: Adapt the translated questions and options to a specific evaluation template to construct a corpus for continuous pre - training. - **Training the model**: Use the translated data to over - fit train multilingual LLMs so that they can remember these translated versions of the benchmark dataset. #### 4. Detecting cross - lingual contamination - **Memory - based methods**: - **Shared likelihood**: Detect contamination by statistically analyzing the difference in log - probabilities of the model on the original dataset and the shuffled version. - **Guided prompts**: Use carefully designed prompts to guide the model to generate specific texts and judge the similarity between it and the evaluation data by LLM. - **N - gram accuracy**: Remove the answer part of the generated benchmark test and verify whether the output generated by the model matches the removed answer text. - **Generalization - based methods**: - **Constructing a generalization benchmark**: Replace the wrong options of the current question with the correct options of other questions to create a generalization - version of the benchmark dataset. - **Measuring contamination**: Calculate the performance difference of the same model on the generalization - version and the original - version benchmark datasets as an indicator to evaluate potential contamination. ### Experimental results - **Memory - based methods**: Existing memory - based detection methods can only detect part of the contamination, especially cross - lingual contamination. - **Generalization - based methods**: The proposed generalization - based method can effectively detect cross - lingual contamination, especially when the performance of the model on the generalization - version benchmark dataset decreases or hardly improves, indicating the presence of contamination. ### Discussion - **Utilization of cross - lingual contamination**: Besides being used for cheating, cross - lingual contamination can also be used to explain the working mechanism of LLMs and improve the multilingual ability of LLMs through post - training. - **Future work**: Further explore the impact of different language categories on the effect of cross - lingual contamination and how to use these findings to improve the multilingual performance of LLMs. ### Conclusion By identifying and detecting cross - lingual contamination, the paper provides a new perspective to understand and prevent data contamination problems in LLM training. The proposed generalization - based method provides a powerful tool for future LLM evaluation and improvement.