Backdooring Bias into Text-to-Image Models

Ali Naseh,Jaechul Roh,Eugene Bagdasaryan,Amir Houmansadr
2024-10-11
Abstract:Text-conditional diffusion models, i.e. text-to-image, produce eye-catching images that represent descriptions given by a user. These images often depict benign concepts but could also carry other purposes. Specifically, visual information is easy to comprehend and could be weaponized for propaganda -- a serious challenge given widespread usage and deployment of generative models. In this paper, we show that an adversary can add an arbitrary bias through a backdoor attack that would affect even benign users generating images. While a user could inspect a generated image to comply with the given text description, our attack remains stealthy as it preserves semantic information given in the text prompt. Instead, a compromised model modifies other unspecified features of the image to add desired biases (that increase by 4-8x). Furthermore, we show how the current state-of-the-art generative models make this attack both cheap and feasible for any adversary, with costs ranging between $12-$18. We evaluate our attack over various types of triggers, adversary objectives, and biases and discuss mitigations and future work. Our code is available at <a class="link-external link-https" href="https://github.com/jrohsc/Backdororing_Bias" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper explores how to inject specific biases into text - to - image models (T2I) through backdoor attacks, thereby influencing the content of the generated images. Specifically, the main research questions in the paper include: 1. **How to inject hidden biases in T2I models**: - The paper shows that an attacker can inject poisoned samples with specific trigger words into the training data, so that the generated images show specific biases when these trigger words are included. For example, when the input prompt contains "president" and "writing", the generated image may show a bald president wearing a red tie. 2. **The practical feasibility and cost of this attack**: - The paper evaluates whether the current state - of - the - art generation models (such as Stable Diffusion) make this attack cheap and feasible. Research shows that an attacker can generate effective poisoned samples and carry out the attack by spending only about $12 - $18. 3. **The effectiveness and concealment of the attack**: - The paper verifies the success rate of the attack through experiments and discusses how to make the attack difficult to detect without significantly reducing the utility of the model. For example, an attacker can ensure that the generated image still accurately reflects the semantic information in the text prompt, but at the same time introduces specific biases. 4. **Potential social impacts**: - The paper also discusses the possible social harms caused by this attack, such as commercial promotion, political propaganda, spreading misinformation, and strengthening harmful stereotypes. For example, an attacker can carry out covert commercial promotion by generating images with specific brands, or carry out political propaganda by generating images with specific political figures or parties. 5. **Challenges of defense strategies**: - Finally, the paper points out the current difficulties in defending against such attacks, because traditional detection methods cannot effectively identify the hidden biases of multi - trigger - word combinations. This makes the attack more difficult to prevent and detect. In summary, this paper aims to reveal the security vulnerabilities in T2I models, proposes a new backdoor attack method, and emphasizes the importance of security assessment and defense measures for generation models.