Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu,Qingyuan Cheng,Runyu Peng,Xiaonan Li,Tengxiao Liu,Ru Peng,Xipeng Qiu,Xuanjing Huang
2024-06-24
Abstract:The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of performance inflation in the evaluation process of large language models (LLMs) due to data leakage. Specifically: 1. **Data leakage problem**: Current large language models may inadvertently come into contact with test data during the training process, resulting in data leakage. This leakage may make the model's performance in benchmark tests overly optimistic and unable to truly reflect its performance in practical applications. 2. **Re - use of benchmark tests**: Building new benchmark tests is time - consuming and labor - intensive, and there is still a risk of leakage. Therefore, how to reuse already - leaked benchmark tests to evaluate the performance of large language models has become an urgent problem to be solved. ### Solutions The paper proposes a method named **Inference - Time Decontamination (ITD)**, which alleviates performance inflation caused by memorizing leaked benchmark data by detecting and rewriting leaked samples without changing their difficulty. The specific steps are as follows: 1. **Detection**: Use a detector to identify potential leaked samples. The detector judges whether the text appears in the pre - training data based on the MinKProb (minimum probability method). 2. **Rewrite**: Rewrite the detected leaked samples to change their expressions without changing their core content and difficulty. For knowledge - based tasks (such as MMLU), keep the knowledge points unchanged and rewrite the expressions of the questions; for mathematical reasoning tasks (such as GSM8K), keep the calculation logic and answer structure unchanged and change the background settings. 3. **Verification**: Re - detect the rewritten samples to ensure that they are no longer memorized by the model. If they are still detected as leaked, continue to rewrite until they pass the detection or reach the maximum number of rewrites. ### Experimental results - **Proof - of - concept experiment**: Experiments were carried out on artificially leaked data. The results showed that after using ITD, the accuracy of the model on GSM8K and MMLU decreased by 22.9% and 19.0% respectively, proving the effectiveness of ITD. - **Actual model experiment**: Experiments were carried out on two models, Mistral - 7b and Phi - 3 - mini - 128k - instruct. The results showed that after using ITD, the performance of these models on GSM8K and MMLU also decreased, further verifying the effectiveness of ITD. ### Core contributions 1. Proposed the Inference - Time Decontamination (ITD) method to alleviate the inflation of evaluation results caused by data leakage. 2. Through the proof - of - concept experiment, it was proved that ITD can effectively alleviate the model's biased performance due to memorizing benchmark data. 3. Experiments were carried out on two commonly used large language models, and it was found that their performance on MMLU and GSM8K decreased to varying degrees. 4. Released the rewritten GSM8K and MMLU datasets to promote future evaluation work. ### Conclusion The Inference - Time Decontamination (ITD) method provides an effective solution. It can reuse the leaked benchmark tests to more realistically evaluate the performance of large language models without building new benchmark tests. This not only saves resources but also improves the fairness and reliability of the evaluation.