Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Yuxin Wen,Leo Marchyok,Sanghyun Hong,Jonas Geiping,Tom Goldstein,Nicholas Carlini
2024-04-02
Abstract:It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the privacy leakage that arises when fine-tuning a model: when a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. We conduct extensive experiments on various datasets and models, including both vision-language models (CLIP) and large language models, demonstrating the broad applicability and effectiveness of such an attack. Additionally, we carry out multiple ablation studies with different fine-tuning methods and inference strategies to thoroughly analyze this new threat. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily explores a new type of backdoor attack—privacy backdoor attack. This attack involves injecting malicious weights into a pre-trained model, thereby leaking information from the user's dataset during the fine-tuning process. Specifically: 1. **Background and Current Situation**: - The widespread use of pre-trained foundational models has made it more common to adapt them to specific tasks through fine-tuning. - The abundance of open-source pre-trained models on the internet provides convenience for researchers but also introduces security risks. 2. **New Issues Introduced**: - Traditional backdoor attacks typically modify triggers in the input data to change the model's behavior, whereas the privacy backdoor attack described in this paper involves embedding malicious weights in the pre-trained model, making the fine-tuned model more likely to leak training data information. - Attackers upload models embedded with malicious weights, and when victims download and fine-tune these models, their training data gets leaked. 3. **Specific Goals**: - By modifying the model weights, the model's loss on specific data points is abnormally increased during fine-tuning, thereby improving the success rate of membership inference attacks. - Conducting the attack without being detected, i.e., maintaining model performance by adding auxiliary loss during the poisoning process. In summary, this paper aims to reveal a new privacy threat in pre-trained models and emphasizes the need to reassess security protocols when using open-source pre-trained models.