Defending Our Privacy With Backdoors

Dominik Hintersdorf,Lukas Struppek,Daniel Neider,Kristian Kersting
DOI: https://doi.org/10.48550/arXiv.2310.08320
2024-07-23
Abstract:The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information, such as names and faces of individuals, from vision-language models by fine-tuning them for only a few minutes instead of re-training them from scratch. Specifically, by strategically inserting backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name. For image encoders, we map individuals' embeddings to be removed from the model to a universal, anonymous embedding. The results of our extensive experimental evaluation demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides a new "dual-use" perspective on backdoor attacks and presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
Machine Learning,Computation and Language,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to remove specific privacy information (such as personal names and faces) from large - scale AI models without affecting the model performance, in order to prevent privacy attacks. Specifically, the author proposes a method based on backdoor attacks to achieve this goal by fine - tuning the model instead of retraining the entire model. ### Problem Background As large AI models (such as CLIP, Stable Diffusion, etc.) are trained with web data that has not been fully screened, these models may contain sensitive personal information, thus causing privacy problems. Attackers can extract sensitive information from the training data through privacy attacks (such as model inversion attacks and membership inference attacks). The existing methods for removing specific information are either computationally and memory - intensive or only applicable to specific types of models. ### Core Problems of the Paper The paper proposes a novel method to protect privacy using backdoor attacks. Specifically, the author solves the problem in the following ways: 1. **Introducing Backdoor Attacks for Privacy Protection**: The author first proposes the idea of using backdoor attacks for privacy protection. By inserting backdoors in the text encoder and the image encoder, the sensitive information is mapped to a neutral embedding, thus removing specific privacy information. 2. **Specific Methods**: - **Text Encoder**: By using specific names as triggers, the embeddings of these names are mapped to the embedding of a neutral phrase (such as "a person" or "human"). - **Image Encoder**: By using specific faces as triggers, the embeddings of these faces are mapped to a general anonymous embedding. 3. **Experimental Verification**: The author verifies the effectiveness of this method through experiments. In particular, the defense effect is evaluated using the Identity Inference Attack (IDIA), and it is shown that this method successfully removes the information of specific individuals while maintaining the model performance. ### Mathematical Formula Representation To ensure the utility of the model and inject the backdoor, the author minimizes a loss function \( L \), which is defined as follows: \[ L = L_{\text{Backdoor}}+\beta\|\tilde{\theta}-\theta\| \] where, \[ L_{\text{Backdoor}} = -\frac{1}{|T|}\sum_{x\in T}d(M(x),\tilde{M}(x))-\alpha\frac{1}{|Z|}\sum_{x\in Z}d(\Delta,\tilde{M}(x)) \] - \( T \) is a set containing general data samples, without any sensitive information. - \( Z \) is a set of data samples containing sensitive features to be removed from the encoder. - \( \Delta \) is the target embedding of the backdoor. - \( d \) is the cosine similarity function. - \( \beta \) and \( \alpha \) are regularization weights. In this way, the author ensures that the model does not significantly reduce performance when injecting the backdoor, while effectively removing specific privacy information. ### Summary The paper proposes an innovative method to protect privacy using backdoor attacks. By fine - tuning the model instead of retraining the entire model, specific privacy information is successfully removed, thus improving the privacy protection ability of the model.