Abstract:The problem of pre-training data detection for large language models (LLMs) has received growing attention due to its implications in critical issues like copyright violation and test data contamination. Despite improved performance, existing methods (including the state-of-the-art, Min-K%) are mostly developed upon simple heuristics and lack solid, reasonable foundations. In this work, we propose a novel and theoretically motivated methodology for pre-training data detection, named Min-K%++. Specifically, we present a key insight that training samples tend to be local maxima of the modeled distribution along each input dimension through maximum likelihood training, which in turn allow us to insightfully translate the problem into identification of local maxima. Then, we design our method accordingly that works under the discrete distribution modeled by LLMs, whose core idea is to determine whether the input forms a mode or has relatively high probability under the conditional categorical distribution. Empirically, the proposed method achieves new SOTA performance across multiple settings. On the WikiMIA benchmark, Min-K%++ outperforms the runner-up by 6.2% to 10.5% in detection AUROC averaged over five models. On the more challenging MIMIR benchmark, it consistently improves upon reference-free methods while performing on par with reference-based method that requires an extra reference model.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of pre - training data detection in large language models (LLMs). Specifically, the authors are concerned with how to determine whether a given input text has been used to train a particular large language model. This problem is important because it involves key issues such as copyright infringement and test data contamination. #### Background and Motivation 1. **Importance of the problem**: - **Copyright issues**: If the pre - training data contains copyrighted content (such as books, news articles), it may infringe on the rights of content creators. - **Data leakage risks**: The pre - training data may contain private information, which is easily extracted and misused. - **Effectiveness of evaluation benchmarks**: If the evaluation data has been seen by the model during training, the validity and reliability of the evaluation results will be questioned. 2. **Limitations of existing methods**: - Most of the existing methods (including the state - of - the - art Min - K%) are based on simple heuristic rules and lack a solid theoretical foundation. - These methods perform poorly when dealing with large - scale pre - training corpora, especially when facing complex data distributions. #### Proposed New Method To solve the above problems, the authors propose a new method - **Min - K%++**. The main contributions of this method are as follows: 1. **Theoretical foundation**: - By re - examining the maximum likelihood training objective, the authors find that training samples tend to be local maximum points along each input dimension. - This insight enables them to transform the pre - training data detection problem into the problem of identifying local maximums. 2. **Method design**: - Min - K%++ determines whether the input is training data by checking whether it forms a pattern or has a relatively high probability. - Specifically, this method calculates the log - probability under the conditional classification distribution of the input sequence and compares it with the expected log - probability. 3. **Performance improvement**: - On the WikiMIA and MIMIR two benchmark datasets, Min - K%++ significantly outperforms existing methods. For example, on the WikiMIA benchmark, Min - K%++ has an AUROC score that is on average 6.2% to 10.5% higher than the second - place Min - K%. - On the more challenging MIMIR benchmark, Min - K%++ also performs well and can even be compared with methods that require additional reference models. #### Experimental Verification To verify the effectiveness of Min - K%++: - **Benchmark datasets**: The WikiMIA and MIMIR two benchmark datasets are used. - **Model selection**: It covers a variety of large language models, such as Pythia, LLaMA, OPT, and Mamba. - **Evaluation metrics**: AUROC (Area Under the Receiver Operating Characteristic Curve) is mainly used as an evaluation metric, and the performance on different - length inputs and different models is reported. In general, Min - K%++ provides a theoretically - based and empirically - effective pre - training data detection method, significantly improving the state - of - the - art.

Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models

Detecting Pretraining Data from Large Language Models

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

How to Train Data-Efficient LLMs

Probing Language Models for Pre-training Data Detection

Data Proportion Detection for Optimized Data Management for Large Language Models

MiniPLM: Knowledge Distillation for Pre-Training Language Models

MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

Training on the Benchmark Is Not All You Need

Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

Improving Logits-based Detector without Logits from Black-box LLMs

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

YuLan-Mini: An Open Data-efficient Language Model