SneakyPrompt: Jailbreaking Text-to-image Generative Models

Yuchen Yang,Bo Hui,Haolin Yuan,Neil Gong,Yinzhi Cao

2023-11-11

Abstract:Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: \url{<a class="link-external link-https" href="https://github.com/Yuchen413/text2image_safety" rel="external noopener nofollow">this https URL</a>}.

Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of how to bypass the safety filters in text-to-image generation models (such as Stable Diffusion and DALL·E) to generate images containing Not Safe For Work (NSFW) content. Although these models have built-in safety filters to prevent the generation of NSFW images, the authors propose an automated attack framework called SneakyPrompt, which can generate such images despite the presence of these safety filters. Specifically, SneakyPrompt bypasses the safety filters by repeatedly querying the text-to-image generation model and strategically perturbing the tokens in the prompts based on the query results. Additionally, SneakyPrompt leverages reinforcement learning to guide the token perturbation process, enabling it to efficiently find ways to bypass the safety filters while maintaining the NSFW semantics of the generated images. The study shows that SneakyPrompt not only successfully bypasses closed-source safety filters but also outperforms existing text-based adversarial attack methods in generating NSFW images.

SneakyPrompt: Jailbreaking Text-to-image Generative Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

BSPA: Exploring Black-box Stealthy Prompt Attacks Against Image Generators

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Multimodal Pragmatic Jailbreak on Text-to-image Models

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Backdooring Bias into Text-to-Image Models

Universal Prompt Optimizer for Safe Text-to-Image Generation

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

Automatic Jailbreaking of the Text-to-Image Generative AI Systems