Abstract:The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of performance inflation in the evaluation process of large language models (LLMs) due to data leakage. Specifically: 1. **Data leakage problem**: Current large language models may inadvertently come into contact with test data during the training process, resulting in data leakage. This leakage may make the model's performance in benchmark tests overly optimistic and unable to truly reflect its performance in practical applications. 2. **Re - use of benchmark tests**: Building new benchmark tests is time - consuming and labor - intensive, and there is still a risk of leakage. Therefore, how to reuse already - leaked benchmark tests to evaluate the performance of large language models has become an urgent problem to be solved. ### Solutions The paper proposes a method named **Inference - Time Decontamination (ITD)**, which alleviates performance inflation caused by memorizing leaked benchmark data by detecting and rewriting leaked samples without changing their difficulty. The specific steps are as follows: 1. **Detection**: Use a detector to identify potential leaked samples. The detector judges whether the text appears in the pre - training data based on the MinKProb (minimum probability method). 2. **Rewrite**: Rewrite the detected leaked samples to change their expressions without changing their core content and difficulty. For knowledge - based tasks (such as MMLU), keep the knowledge points unchanged and rewrite the expressions of the questions; for mathematical reasoning tasks (such as GSM8K), keep the calculation logic and answer structure unchanged and change the background settings. 3. **Verification**: Re - detect the rewritten samples to ensure that they are no longer memorized by the model. If they are still detected as leaked, continue to rewrite until they pass the detection or reach the maximum number of rewrites. ### Experimental results - **Proof - of - concept experiment**: Experiments were carried out on artificially leaked data. The results showed that after using ITD, the accuracy of the model on GSM8K and MMLU decreased by 22.9% and 19.0% respectively, proving the effectiveness of ITD. - **Actual model experiment**: Experiments were carried out on two models, Mistral - 7b and Phi - 3 - mini - 128k - instruct. The results showed that after using ITD, the performance of these models on GSM8K and MMLU also decreased, further verifying the effectiveness of ITD. ### Core contributions 1. Proposed the Inference - Time Decontamination (ITD) method to alleviate the inflation of evaluation results caused by data leakage. 2. Through the proof - of - concept experiment, it was proved that ITD can effectively alleviate the model's biased performance due to memorizing benchmark data. 3. Experiments were carried out on two commonly used large language models, and it was found that their performance on MMLU and GSM8K decreased to varying degrees. 4. Released the rewritten GSM8K and MMLU datasets to promote future evaluation work. ### Conclusion The Inference - Time Decontamination (ITD) method provides an effective solution. It can reuse the leaked benchmark tests to more realistically evaluate the performance of large language models without building new benchmark tests. This not only saves resources but also improves the fairness and reliability of the evaluation.

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Don't Make Your LLM an Evaluation Benchmark Cheater

Benchmarking Benchmark Leakage in Large Language Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Training on the Benchmark Is Not All You Need

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Data Contamination Through the Lens of Time

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Are Large Language Models Memorizing Bug Benchmarks?

Benchmark Data Contamination of Large Language Models: A Survey

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

How much can we forget about Data Contamination?

tinyBenchmarks: evaluating LLMs with fewer examples

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Data Contamination Can Cross Language Barriers