Data Contamination Calibration for Black-box LLMs

Wentao Ye,Jiaqi Hu,Liyao Li,Haobo Wang,Gang Chen,Junbo Zhao
2024-06-03
Abstract:The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) -- from machine learning community -- by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the data contamination issue in the training data of large - language models (LLMs). Specifically, due to the continuous expansion of the training set size, some benchmark test data may be inadvertently included in the training set, which leads to misleading evaluation results and further affects the effectiveness and safety of the model. In addition, such data contamination may also lead to legal, privacy - invasion and bias problems. The paper proposes a method named Polarized Augment Calibration (PAC), aiming to detect and reduce the impact of this data contamination. PAC enhances the detection ability of training data by generating adjacent samples and calculating the polarization distance, and is applicable to most current white - box and black - box LLMs. Through extensive experiments, PAC performs well on multiple data set formats and underlying LLMs, improving the data contamination detection performance by at least 4.5% compared with existing methods.