Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Shawn Shan,Wenxin Ding,Josephine Passananti,Stanley Wu,Haitao Zheng,Ben Y. Zhao
2024-04-30
Abstract:Data poisoning attacks manipulate training data to introduce unexpected behaviors into machine learning models at training time. For text-to-image generative models with massive training datasets, current understanding of poisoning attacks suggests that a successful attack would require injecting millions of poison samples into their training pipeline. In this paper, we show that poisoning attacks can be successful on generative models. We observe that training data per concept can be quite limited in these models, making them vulnerable to prompt-specific poisoning attacks, which target a model's ability to respond to individual prompts.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to conduct data - poisoning attacks on text - to - image generation models based on diffusion models, especially attacks targeting specific prompt words. Although these models are usually trained on billions of pictures, which makes data - poisoning attacks in the traditional sense difficult to succeed (because a large number of poisoning samples are required), the author finds that these models are actually very vulnerable to data - poisoning attacks targeting specific concepts or prompt words. The paper achieves this goal through two key insights: 1. **Concept Sparsity**: Although diffusion models are trained on a large amount of data, the number of training samples related to specific concepts or prompt words is relatively small, usually only a few thousand. This means that these models are very fragile to data - poisoning attacks on specific prompt words. 2. **Carefully - Designed Poisoning Samples**: By carefully designing poisoning samples, the poisoning effect can be maximized, so that the output of the model can be successfully controlled when using an extremely small number of poisoning samples (less than 100). Based on these insights, the author proposes an optimized data - poisoning attack method for specific prompt words named **Nightshade**. Nightshade has the following characteristics: - **High Concealment**: Poisoning samples look almost the same as normal samples, avoiding being detected through manual inspection or prompt generation. - **High Effectiveness**: Even with a very small number of poisoning samples (for example, 100), it can successfully carry out poisoning attacks. - **Influence Scalability**: The poisoning effect is not limited to specific concepts, but will also "permeate" into related concepts, making the attack more difficult to avoid. - **Model Destructiveness**: When multiple independent Nightshade attacks are applied to different prompt words of the same model, the basic feature understanding of the model will be destroyed, resulting in its inability to generate meaningful images. In addition, the author also discusses that Nightshade can be used as a tool for content owners to protect intellectual property rights and prevent web crawlers from ignoring the "do not crawl" instructions. The paper verifies the effectiveness of these attacks through experiments and explores their potential applications and impacts.