Abstract:The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information, such as names and faces of individuals, from vision-language models by fine-tuning them for only a few minutes instead of re-training them from scratch. Specifically, by strategically inserting backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name. For image encoders, we map individuals' embeddings to be removed from the model to a universal, anonymous embedding. The results of our extensive experimental evaluation demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides a new "dual-use" perspective on backdoor attacks and presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to remove specific privacy information (such as personal names and faces) from large - scale AI models without affecting the model performance, in order to prevent privacy attacks. Specifically, the author proposes a method based on backdoor attacks to achieve this goal by fine - tuning the model instead of retraining the entire model. ### Problem Background As large AI models (such as CLIP, Stable Diffusion, etc.) are trained with web data that has not been fully screened, these models may contain sensitive personal information, thus causing privacy problems. Attackers can extract sensitive information from the training data through privacy attacks (such as model inversion attacks and membership inference attacks). The existing methods for removing specific information are either computationally and memory - intensive or only applicable to specific types of models. ### Core Problems of the Paper The paper proposes a novel method to protect privacy using backdoor attacks. Specifically, the author solves the problem in the following ways: 1. **Introducing Backdoor Attacks for Privacy Protection**: The author first proposes the idea of using backdoor attacks for privacy protection. By inserting backdoors in the text encoder and the image encoder, the sensitive information is mapped to a neutral embedding, thus removing specific privacy information. 2. **Specific Methods**: - **Text Encoder**: By using specific names as triggers, the embeddings of these names are mapped to the embedding of a neutral phrase (such as "a person" or "human"). - **Image Encoder**: By using specific faces as triggers, the embeddings of these faces are mapped to a general anonymous embedding. 3. **Experimental Verification**: The author verifies the effectiveness of this method through experiments. In particular, the defense effect is evaluated using the Identity Inference Attack (IDIA), and it is shown that this method successfully removes the information of specific individuals while maintaining the model performance. ### Mathematical Formula Representation To ensure the utility of the model and inject the backdoor, the author minimizes a loss function \( L \), which is defined as follows: \[ L = L_{\text{Backdoor}}+\beta\|\tilde{\theta}-\theta\| \] where, \[ L_{\text{Backdoor}} = -\frac{1}{|T|}\sum_{x\in T}d(M(x),\tilde{M}(x))-\alpha\frac{1}{|Z|}\sum_{x\in Z}d(\Delta,\tilde{M}(x)) \] - \( T \) is a set containing general data samples, without any sensitive information. - \( Z \) is a set of data samples containing sensitive features to be removed from the encoder. - \( \Delta \) is the target embedding of the backdoor. - \( d \) is the cosine similarity function. - \( \beta \) and \( \alpha \) are regularization weights. In this way, the author ensures that the model does not significantly reduce performance when injecting the backdoor, while effectively removing specific privacy information. ### Summary The paper proposes an innovative method to protect privacy using backdoor attacks. By fine - tuning the model instead of retraining the entire model, specific privacy information is successfully removed, thus improving the privacy protection ability of the model.

Defending Our Privacy With Backdoors

Adversarial for Good – Defending Training Data Privacy with Adversarial Attack Wisdom

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Data Stealing Attacks against Large Language Models via Backdooring

Mitigating Cross-modal Retrieval Violations with Privacy-preserving Backdoor Learning

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Rethinking Backdoor Attacks

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Backdoor Attacks Against Deep Learning Systems in the Physical World

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies

Exploiting vulnerabilities of deep neural networks for privacy protection

Turning Backdoors for Efficient Privacy Protection Against Image Retrieval Violations

Regula Sub-rosa: Latent Backdoor Attacks on Deep Neural Networks

Model-agnostic clean-label backdoor mitigation in cybersecurity environments

Resurrecting Trust in Facial Recognition: Mitigating Backdoor Attacks in Face Recognition to Prevent Potential Privacy Breaches

Attack as Defense: Run-time Backdoor Implantation for Image Content Protection

Reverse Engineering Imperceptible Backdoor Attacks on Deep Neural Networks for Detection and Training Set Cleansing

Memory Backdoor Attacks on Neural Networks

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review