Abstract:Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of benchmark dataset leakage in large - language models (LLMs). Specifically, with the expanded use of pre - training data, the phenomenon of benchmark dataset leakage has become more and more serious, especially in the case of a lack of transparency during the training process and the non - public nature of the supervision data. This phenomenon will distort the effectiveness of the benchmark and lead to unfair comparisons, thus hindering the healthy development of this field. #### Main problems: 1. **Benchmark dataset leakage**: Many large - language models may inadvertently use data from benchmark datasets during the training process, which will cause the model to perform better on these benchmarks than its actual ability. 2. **Unfair comparison**: Since some models may use benchmark data for training while others do not, this leads to unfair performance comparisons. 3. **Lack of transparency**: The opacity of training data and model details makes it difficult to objectively evaluate the true capabilities of the model, thereby affecting the fair evaluation of the model. To solve these problems, the author proposes a detection pipeline based on perplexity and N - gram accuracy to identify potential data leakage. By analyzing the performance of 31 large - language models on mathematical reasoning tasks, it is revealed that many models may have misused the training or even test sets, resulting in unfair comparisons. #### Key points of the solution: 1. **Introducing a detection pipeline**: Use the two simple and scalable metrics of perplexity and N - gram accuracy to measure the prediction accuracy of the model on the benchmark data, thereby identifying potential data leakage. 2. **Instance - level detection**: It can not only detect leakage at the overall dataset level, but also perform instance - level leakage detection, further improving the accuracy of detection. 3. **Transparency suggestions**: In order to promote transparency and healthy development, the author suggests introducing a "Benchmark Transparency Card", encouraging models to be released with detailed documentation, recording whether benchmark data is used for specific performance enhancement and whether there are data enhancement operations. Through these methods, the author hopes to promote the transparency and fairness of language model development and evaluation, and ensure the scientific nature and effectiveness of research. ### Formula summary 1. **Perplexity (Perplexity)**: \[ PPL(X)=\exp\left(-\frac{1}{t}\sum_{i = 0}^{t}\log p_{\theta}(x_{i}|x_{<i})\right) \] where \(X = [x_{0},x_{1},\ldots,x_{t}]\) represents the tokenized sequence. 2. **N - gram Accuracy (N - gram Accuracy)**: \[ N\text{-gram Accuracy}(X)=\frac{1}{S}\sum_{i = 0}^{S}\sum_{j = 0}^{K}I(X_{start_{j}:start_{j}+n},\hat{X}_{start_{j}:start_{j}+n}) \] where \(S\) is the dataset size, \(start_{j}\) represents the index corresponding to the \(j\) - th starting point, \(X_{start_{j}:start_{j}+n}\) represents the golden n - gram to be predicted, \(\hat{X}_{start_{j}:start_{j}+n}\) represents the n - gram generated by the model, and \(I\) is the indicator function, and the exact matching method is adopted by default. Through these formulas and methods, the author provides a comprehensive set of tools to evaluate model behavior and detect potential training data leakage and other model vulnerabilities.

Benchmarking Benchmark Leakage in Large Language Models

Training on the Benchmark Is Not All You Need

Don't Make Your LLM an Evaluation Benchmark Cheater

Are Large Language Models Memorizing Bug Benchmarks?

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

Benchmark Data Contamination of Large Language Models: A Survey

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

SafetyBench: Evaluating the Safety of Large Language Models

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

BenchmarkCards: Large Language Model and Risk Reporting

On Inter-dataset Code Duplication and Data Leakage in Large Language Models