Abstract:Text-conditional diffusion models, i.e. text-to-image, produce eye-catching images that represent descriptions given by a user. These images often depict benign concepts but could also carry other purposes. Specifically, visual information is easy to comprehend and could be weaponized for propaganda -- a serious challenge given widespread usage and deployment of generative models. In this paper, we show that an adversary can add an arbitrary bias through a backdoor attack that would affect even benign users generating images. While a user could inspect a generated image to comply with the given text description, our attack remains stealthy as it preserves semantic information given in the text prompt. Instead, a compromised model modifies other unspecified features of the image to add desired biases (that increase by 4-8x). Furthermore, we show how the current state-of-the-art generative models make this attack both cheap and feasible for any adversary, with costs ranging between $12-$18. We evaluate our attack over various types of triggers, adversary objectives, and biases and discuss mitigations and future work. Our code is available at <a class="link-external link-https" href="https://github.com/jrohsc/Backdororing_Bias" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper explores how to inject specific biases into text - to - image models (T2I) through backdoor attacks, thereby influencing the content of the generated images. Specifically, the main research questions in the paper include: 1. **How to inject hidden biases in T2I models**: - The paper shows that an attacker can inject poisoned samples with specific trigger words into the training data, so that the generated images show specific biases when these trigger words are included. For example, when the input prompt contains "president" and "writing", the generated image may show a bald president wearing a red tie. 2. **The practical feasibility and cost of this attack**: - The paper evaluates whether the current state - of - the - art generation models (such as Stable Diffusion) make this attack cheap and feasible. Research shows that an attacker can generate effective poisoned samples and carry out the attack by spending only about $12 - $18. 3. **The effectiveness and concealment of the attack**: - The paper verifies the success rate of the attack through experiments and discusses how to make the attack difficult to detect without significantly reducing the utility of the model. For example, an attacker can ensure that the generated image still accurately reflects the semantic information in the text prompt, but at the same time introduces specific biases. 4. **Potential social impacts**: - The paper also discusses the possible social harms caused by this attack, such as commercial promotion, political propaganda, spreading misinformation, and strengthening harmful stereotypes. For example, an attacker can carry out covert commercial promotion by generating images with specific brands, or carry out political propaganda by generating images with specific political figures or parties. 5. **Challenges of defense strategies**: - Finally, the paper points out the current difficulties in defending against such attacks, because traditional detection methods cannot effectively identify the hidden biases of multi - trigger - word combinations. This makes the attack more difficult to prevent and detect. In summary, this paper aims to reveal the security vulnerabilities in T2I models, proposes a new backdoor attack method, and emphasizes the importance of security assessment and defense measures for generation models.

Backdooring Bias into Text-to-Image Models

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models

Manipulating and Mitigating Generative Model Biases without Retraining

Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis

Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Backdooring Textual Inversion for Concept Censorship

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

Attack as Defense: Run-time Backdoor Implantation for Image Content Protection

Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models

TrojanEdit: Backdooring Text-Based Image Editing Models

Natural Language Induced Adversarial Images

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks

The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline

SneakyPrompt: Jailbreaking Text-to-image Generative Models

Invisible Backdoor Attacks on Diffusion Models