SneakyPrompt: Jailbreaking Text-to-image Generative Models

Yuchen Yang,Bo Hui,Haolin Yuan,Neil Gong,Yinzhi Cao
2023-11-11
Abstract:Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: \url{<a class="link-external link-https" href="https://github.com/Yuchen413/text2image_safety" rel="external noopener nofollow">this https URL</a>}.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of how to bypass the safety filters in text-to-image generation models (such as Stable Diffusion and DALL·E) to generate images containing Not Safe For Work (NSFW) content. Although these models have built-in safety filters to prevent the generation of NSFW images, the authors propose an automated attack framework called SneakyPrompt, which can generate such images despite the presence of these safety filters. Specifically, SneakyPrompt bypasses the safety filters by repeatedly querying the text-to-image generation model and strategically perturbing the tokens in the prompts based on the query results. Additionally, SneakyPrompt leverages reinforcement learning to guide the token perturbation process, enabling it to efficiently find ways to bypass the safety filters while maintaining the NSFW semantics of the generated images. The study shows that SneakyPrompt not only successfully bypasses closed-source safety filters but also outperforms existing text-based adversarial attack methods in generating NSFW images.