Abstract:Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to uncover and address a previously under - explored security vulnerability in image - to - image (I2I) diffusion models: **generating Not Safe For Work (NSFW) content through adversarial image attacks**. Specifically, the authors propose a new framework, **AdvI2I**, which can fine - tune the input image to induce the diffusion model to generate NSFW content without changing the text prompt. This allows the attack to bypass existing defense mechanisms, such as Safe Latent Diffusion (SLD), thus exposing the security deficiencies of I2I diffusion models. #### Main problems and background 1. **Progress and security issues of diffusion models**: - Diffusion models have made significant progress in image synthesis, but they also bring serious security issues, especially the generation of NSFW content. - Existing research has mainly focused on generating NSFW content through adversarial text prompts, but these prompts are easily detected by text - based filters, limiting their effectiveness. 2. **New threats of adversarial image attacks**: - The authors found that, in addition to text prompts, the input image can also be used as an attack medium. By manipulating the input image, an attacker can induce the diffusion model to generate NSFW content. - This type of attack is more difficult to be detected by existing defense mechanisms because it does not depend on changes in text prompts. 3. **Limitations of existing defense mechanisms**: - Although some defense mechanisms (such as SLD, negative prompts, Gaussian noise, etc.) have been proposed, these methods still have vulnerabilities when facing complex adversarial image attacks. - The authors verified the effectiveness and limitations of these defense mechanisms through experiments and proposed an improved attack method, **AdvI2I - Adaptive**, which further enhances the robustness of the attack. #### Solutions - **AdvI2I framework**: - Extract NSFW concept embedding vectors and apply them to the image generation process. - Train an adversarial image generator so that the generated adversarial image is visually similar to the original image but can induce the generation of NSFW content in the diffusion model. - **AdvI2I - Adaptive**: - Introduce an additional loss term to minimize the cosine similarity between the generated image and the NSFW concept embedding to deal with the detection of security checkers. - Add Gaussian noise during the training process to improve the robustness of the attack against existing defense measures. #### Experimental results - **Attack Success Rate (ASR) under different defense strategies**: - AdvI2I and its improved version, AdvI2I - Adaptive, perform well under multiple defense mechanisms, especially maintaining a high attack success rate under the security checker (SC). - Experiments show that existing defense mechanisms have obvious deficiencies when facing complex adversarial image attacks, and there is an urgent need to develop more powerful security measures. ### Summary This paper reveals the new threat of adversarial image attacks in I2I diffusion models by proposing the AdvI2I framework and demonstrates the powerful effect of such attacks. At the same time, the authors also emphasize the limitations of current defense mechanisms and call on the research community to further explore and develop more effective defense methods to ensure the security of diffusion models.

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model

The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

MMA-Diffusion: MultiModal Attack on Diffusion Models

Raising the Cost of Malicious AI-Powered Image Editing

When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models

Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models