Abstract:The rise of deep learning (DL) has led to a surging demand for training data, which incentivizes the creators of DL models to trawl through the Internet for training materials. Meanwhile, users often have limited control over whether their data (e.g., facial images) are used to train DL models without their consent, which has engendered pressing concerns. This work proposes MembershipTracker, a practical data provenance tool that can empower ordinary users to take agency in detecting the unauthorized use of their data in training DL models. We view tracing data provenance through the lens of membership inference (MI). MembershipTracker consists of a lightweight data marking component to mark the target data with small and targeted changes, which can be strongly memorized by the model trained on them; and a specialized MI-based verification process to audit whether the model exhibits strong memorization on the target samples. Overall, MembershipTracker only requires the users to mark a small fraction of data (0.005% to 0.1% in proportion to the training set), and it enables the users to reliably detect the unauthorized use of their data (average 0% FPR@100% TPR). We show that MembershipTracker is highly effective across various settings, including industry-scale training on the full-size ImageNet-1k dataset. We finally evaluate MembershipTracker under multiple classes of countermeasures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the unauthorized use of personal data in deep - learning models. Specifically, the paper focuses on how to detect whether personal data has been used to train deep - learning models without the user's consent. With the development of deep - learning technology, the demand for large amounts of training data is increasing day by day, which has led to the phenomenon that data collectors scrape data from the Internet for model training. However, this practice often lacks transparency, and users often have no control over whether their data has been used without authorization, which has raised serious concerns about privacy protection. To solve this problem, the paper proposes a tool named **MembershipTracker**, which tracks the source of data through membership inference (MI), enabling ordinary users to detect whether their data has been used without authorization to train deep - learning models. ### Core contributions of the paper: 1. **Light - weight data - tagging technique**: A light - weight data - tagging method is proposed. By making small and targeted modifications on the target data, these modifications can be strongly remembered by the model, thereby enhancing the effectiveness of membership inference. 2. **Set - based high - power membership - inference verification process**: A new set - based membership - inference verification process has been developed. It can reliably audit whether the target data has been used to train the model without relying on additional shadow models or reference models, and has a low false - positive rate (FPR) and a high true - positive rate (TPR). 3. **Practical data - source - tracking tool**: The above techniques and verification processes are integrated into a tool - **MembershipTracker**, and its effectiveness has been evaluated under multiple settings (including six benchmark datasets and six deep - learning architectures). Experimental results show that even if only an extremely small proportion of samples (0.005% to 0.1%) in the dataset are tagged, **MembershipTracker** can effectively track the data source, with an average false - positive rate of 0% and a true - positive rate of 100%. ### Technical details: - **Data tagging**: Data is tagged through two steps: image fusion and noise injection. First, the original sample is fused with out - of - distribution (OOD) features. The formula is as follows: \[ x \oplus (x_{\text{ood}}, m)=m \cdot x+(1 - m) \cdot x_{\text{ood}} \] where \(m\) controls the contribution ratio of different features. Then, the memory ability of the model for these samples is further enhanced by injecting procedural noise. - **Membership - inference verification**: A set - based membership - inference process is utilized. By comparing the predicted loss values of the target - user - tagged samples and the non - member samples of non - target users, it is determined whether the model has strongly remembered the target samples. ### Evaluation metrics: - **True - positive rate (TPR)** and **False - positive rate (FPR)**: To evaluate the success rate of membership inference, the paper adopts two evaluation metrics: TPR at a fixed FPR and FPR at a fixed TPR. Finally, the FPR at 100% TPR is selected as the main evaluation metric to ensure that all unauthorized - use data can be detected and false alarms can be avoided. Through these innovations, **MembershipTracker** provides ordinary users with a practical tool to help them gain more control over data privacy.

Catch Me if You Can: Detecting Unauthorized Data Use in Deep Learning Models

Membership Inference via Backdooring

A Method to Facilitate Membership Inference Attacks in Deep Learning Models

Your Model Trains on My Data? Protecting Intellectual Property of Training Data via Membership Fingerprint Authentication

TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model Development

Privacy Analysis of Deep Learning in the Wild: Membership Inference Attacks against Transfer Learning

Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference

A General Framework for Data-Use Auditing of ML Models

DeepTaster: Adversarial Perturbation-Based Fingerprinting to Identify Proprietary Dataset Use in Deep Neural Networks

Membership reconstruction attack in deep neural networks

A Deep Learning Framework Supporting Model Ownership Protection and Traitor Tracing

DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models

Is my Data in your AI Model? Membership Inference Test with Application to Face Images

Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models

Watermarking Text Data on Large Language Models for Dataset Copyright

Machine Learning with Membership Privacy using Adversarial Regularization

TMI! Finetuned Models Leak Private Information from their Pretraining Data

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

Membership Privacy for Machine Learning Models Through Knowledge Transfer

Provenance of Training Without Training Data: Towards Privacy-Preserving DNN Model Ownership Verification

MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training