Blind Baselines Beat Membership Inference Attacks for Foundation Models

Debeshee Das,Jie Zhang,Florian Tramèr
2024-06-24
Abstract:Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks can be used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks -- that distinguish the member and non-member distributions without looking at any trained model -- outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model's training data.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the flaws in the current evaluation methods of Membership Inference Attacks (MI attacks) against Foundation Models. Specifically, the author points out: 1. **Inconsistent distribution of member and non - member data**: In many existing MI attack evaluations, member and non - member data are sampled from different distributions. For example, in some datasets, member data comes from data before a certain specific time period, while non - member data comes from the time period after. This temporal difference allows blind attacks to easily distinguish between members and non - members without relying on the trained model. 2. **Unreliable existing evaluation methods**: Due to the above problem, the existing MI attack evaluation cannot accurately reflect whether the model has leaked information about the training data. By designing simple blind attacks (such as date - based detection, bag - of - words classifiers, etc.), the author has proven that these blind attacks outperform the existing state - of - the - art MI attack methods on multiple publicly available MI evaluation datasets. This indicates that the existing MI attacks may not have actually extracted the member information in the model, but have taken advantage of the differences in data distribution. 3. **Propose improvement directions**: In order to evaluate MI attacks more reliably, the author suggests using datasets with clear training set and test set divisions, such as The Pile or DataComp. These datasets can ensure that member and non - member data come from the same distribution, thus avoiding the distribution shift problem. ### Main contributions - **Reveal the flaws of existing evaluation methods**: By analyzing 8 publicly available MI evaluation datasets, the author shows the distribution shift problems existing in these datasets and proves the superior performance of blind attacks on these datasets. - **Provide improvement solutions**: It is recommended that future research should use datasets with randomly divided training sets and test sets to evaluate MI attacks to ensure the reliability of the evaluation results. ### Formula representation The evaluation metrics involved in the paper include: - **TPR@FPR (True Positive Rate at False Positive Rate)**: The True Positive Rate calculated at a given False Positive Rate. \[ \text{TPR@FPR}=\frac{\text{TP}}{\text{TP}+\text{FN}} \] - **AUC ROC (Area Under the Receiver Operating Characteristic Curve)**: The area under the Receiver Operating Characteristic Curve, which is used to measure the overall performance of a classifier. \[ \text{AUC ROC}=\int_{0}^{1}\text{TPR}(FPR)\,d(FPR) \] These formulas are used to evaluate the effectiveness of MI attacks and are compared with the results of blind attacks. ### Summary This paper reveals the serious flaws in the current MI attack evaluation methods and proposes an improved evaluation method. By using datasets with clear training set and test set divisions, future MI attack evaluations will be more reliable and can better reflect the actual member information leakage of the model.