Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Yuxin Wen,Leo Marchyok,Sanghyun Hong,Jonas Geiping,Tom Goldstein,Nicholas Carlini

2024-04-02

Abstract:It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the privacy leakage that arises when fine-tuning a model: when a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. We conduct extensive experiments on various datasets and models, including both vision-language models (CLIP) and large language models, demonstrating the broad applicability and effectiveness of such an attack. Additionally, we carry out multiple ablation studies with different fine-tuning methods and inference strategies to thoroughly analyze this new threat. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.

Cryptography and Security,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores a new type of backdoor attack—privacy backdoor attack. This attack involves injecting malicious weights into a pre-trained model, thereby leaking information from the user's dataset during the fine-tuning process. Specifically: 1. **Background and Current Situation**: - The widespread use of pre-trained foundational models has made it more common to adapt them to specific tasks through fine-tuning. - The abundance of open-source pre-trained models on the internet provides convenience for researchers but also introduces security risks. 2. **New Issues Introduced**: - Traditional backdoor attacks typically modify triggers in the input data to change the model's behavior, whereas the privacy backdoor attack described in this paper involves embedding malicious weights in the pre-trained model, making the fine-tuned model more likely to leak training data information. - Attackers upload models embedded with malicious weights, and when victims download and fine-tune these models, their training data gets leaked. 3. **Specific Goals**: - By modifying the model weights, the model's loss on specific data points is abnormally increased during fine-tuning, thereby improving the success rate of membership inference attacks. - Conducting the attack without being detected, i.e., maintaining model performance by adding auxiliary loss during the poisoning process. In summary, this paper aims to reveal a new privacy threat in pre-trained models and emphasizes the need to reassess security protocols when using open-source pre-trained models.

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

Defending Our Privacy With Backdoors

Data Stealing Attacks against Large Language Models via Backdooring

Amplifying Membership Exposure via Data Poisoning

SoK: Reducing the Vulnerability of Fine-tuned Language Models to Membership Inference Attacks

Stand-in Backdoor: A Stealthy and Powerful Backdoor Attack

TMI! Finetuned Models Leak Private Information from their Pretraining Data

Rethinking Backdoor Detection Evaluation for Language Models

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Effective Backdoor Defense by Exploiting Sensitivity of Poisoned Samples

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Systematic Evaluation of Backdoor Data Poisoning Attacks on Image Classifiers

Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data

Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

A Method to Facilitate Membership Inference Attacks in Deep Learning Models

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning.

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models