Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Shanglun Feng,Florian Tramèr
2024-03-31
Abstract:Practitioners commonly download pretrained machine learning models from open repositories and finetune them to fit specific applications. We show that this practice introduces a new risk of privacy backdoors. By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including transformers, which enable an attacker to reconstruct individual finetuning samples, with a guaranteed success! We further show that backdoored models allow for tight privacy attacks on models trained with differential privacy (DP). The common optimistic practice of training DP models with loose privacy guarantees is thus insecure if the model is not trusted. Overall, our work highlights a crucial and overlooked supply chain attack on machine learning privacy.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In the modern machine - learning supply chain, sharing and fine - tuning of pre - trained models have become common practices, but this trend has brought new security issues, especially supply - chain - based attacks. This paper focuses specifically on privacy vulnerabilities and introduces the concept of **privacy backdoors**, that is, malicious providers endanger the privacy of future fine - tuning data by tampering with the weights of pre - trained models. Specifically, the paper shows how to construct privacy backdoors so that attackers can reconstruct specific fine - tuning samples, and these backdoors can even carry out precise privacy attacks on models trained with differential privacy (DP). This indicates that if the model is untrusted, it is not safe to train DP models even with loose privacy guarantees. ### Main contributions of the paper: 1. **Design and implementation of privacy backdoors**: - A single - use backdoor design is proposed to ensure that once the backdoor is activated and data points are written into the model weights, the backdoor becomes ineffective, preventing further modification of these weights in subsequent training. - This design is similar to a latch in digital memory. Once data is written into the memory, it remains unchanged until the end of training. 2. **Attacks on different models**: - These attacks are applied to multi - layer perceptrons (MLP) and pre - trained Transformer models (such as ViT and BERT), and multiple fine - tuning samples are successfully reconstructed. - Under the black - box threat model, even if the attacker has only the right to query the fine - tuning model, the entire training input can be recovered. 3. **Attacks on the differential - privacy SGD algorithm**: - Using privacy backdoors, the first end - to - end attack on the differential - privacy SGD algorithm proposed by Abadi et al. (2016) is constructed. - The attacker's privacy leakage almost reaches the theoretical upper limit of the algorithm's privacy analysis, challenging the assumption that the privacy guarantees of DP - SGD are too conservative in practice. ### Key techniques in the paper: - **Data Traps**: By setting specific weights and biases in the model, certain input data are captured and written into the model weights during the fine - tuning process. - **Gradient Amplification**: By adding an amplification module in the model, it is ensured that the captured data points generate sufficient gradient signals during the back - propagation process, thereby closing the backdoor. - **Feature Separation**: The internal features of the model are divided into three parts: benign features (for downstream tasks), capture keys (for storing information to be captured), and activation signals (for propagating backdoor outputs). - **Numerical Tricks**: Technical challenges brought by the GELU activation function and layer normalization in Transformer models are dealt with to ensure that the backdoor signal does not disappear or explode during the training process. ### Conclusion: This paper reveals a new attack vector in the modern machine - learning supply chain and emphasizes the need for stricter privacy protection measures when using untrusted shared models. Through detailed technical design and experimental verification, the authors show the powerful attack capabilities of privacy backdoors and pose new challenges to existing privacy protection mechanisms.