SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Zhongjie Ba,Jieming Zhong,Jiachen Lei,Peng Cheng,Qinglong Wang,Zhan Qin,Zhibo Wang,Kui Ren
2024-10-17
Abstract:Advanced text-to-image models such as DALL$\cdot$E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.
Computer Vision and Pattern Recognition,Cryptography and Security
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the security vulnerabilities present in current state-of-the-art text-to-image generation models (such as Midjourney, DALL·E 2, and Stable Diffusion). These models can generate highly realistic images, but they also raise concerns about the proliferation of unsafe content (such as adult, violent, or deceptive political images). Although these models claim to implement strict safety mechanisms to limit the generation of not safe for work (NSFW) content, researchers have successfully designed and demonstrated the first prompt attack against Midjourney, capable of generating a large number of realistic NSFW images. Specifically, the paper reveals the fundamental principles of this prompt attack and circumvents the closed-source safety measures by replacing high-risk parts of suspicious prompts. The researchers propose a new framework—SurrogatePrompt, which systematically generates attack prompts, leveraging large language models and image-to-text modules to automate the large-scale creation of attack prompts. Evaluation results show that their attack prompts successfully bypassed Midjourney's proprietary safety filters, generating fake images depicting political figures in violent scenes with a success rate of 88%. Additionally, they demonstrated methods to generate explicit adult-themed images. ### Main Contributions 1. **Revealing Vulnerabilities**: Explains how attackers can bypass the safety control mechanisms of state-of-the-art text-to-image models. 2. **System Framework**: Developed a system framework for generating adversarial prompts and NSFW images, utilizing the fundamental principle of "replacement." This framework includes two unique automated prompt generation strategies and a technique specifically designed to amplify the quantity of NSFW content. 3. **Efficient Attacks**: Through key observations, the attack methods effectively bypass Midjourney's safety filters to generate unsafe images, demonstrating an impressive attack success rate. Specifically, they achieved an 88% bypass rate in prompts generating politically related violent scenes and a 54.3% bypass rate in prompts generating bloody scenes involving political figures. ### Related Work 1. **Security of Text-to-Image Models**: Existing research indicates that although these models have powerful generative capabilities, they also pose risks of generating unsafe images. Some studies have revealed these models' security vulnerabilities through methods such as reverse engineering and reinforcement learning. 2. **Adversarial Sample Generation**: Adversarial sample generation in text-to-image models is still a relatively new field. Researchers have designed various methods to generate adversarial samples by exploiting hidden vocabulary and language features. These methods perform well on CLIP-based models but may have limited effectiveness on non-CLIP-based models. ### Problem Definition The paper details the typical use scenarios and security threat models of online text-to-image models. Service providers usually implement safety controls to prevent the generation of unsafe images, but attackers may exploit vulnerabilities in these safety controls to design malicious prompts that bypass safety filters, generating NSFW content and distributing it on social media platforms to achieve their harmful intentions. ### Attack Methods The core idea of the paper is to exploit the capability imbalance between safety filters and image synthesis models by replacing sensitive parts of prompts to bypass filters and generate unsafe content. Specific strategies include: 1. **Adult Content**: By replacing sensitive terms with phrases describing clothing that exposes body parts, Midjourney's safety controls can be bypassed. A further method to increase success rates is using Midjourney's "no" parameter to guide the model to exclude certain elements. 2. **Violent Content**: By replacing blood with visually similar substitutes, the filter's recognition can be mitigated, generating images containing violent elements. 3. **False Political Content**: By describing representative behaviors of political figures, the model can be manipulated to generate fraudulent images involving political figures. ### Conclusion This paper explores for the first time the attack methods against Midjourney's safety control system, revealing its potential security vulnerabilities. The research results are significant for improving the security of text-to-image models, especially in widely used models like Midjourney.