Perception-guided Jailbreak against Text-to-Image Models

Yihao Huang,Le Liang,Tianlin Li,Xiaojun Jia,Run Wang,Weikai Miao,Geguang Pu,Yang Liu
2024-08-26
Abstract:In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the safety issues of Text-to-Image (T2I) models when generating inappropriate or unsafe content (such as adult content, violence, and politically sensitive materials). Although current T2I models have integrated safety checkers to prevent the generation of these unsafe images, these safety checkers still have the risk of being bypassed. ### Specific Problem Description 1. **Effectiveness of Safety Checkers**: Existing T2I models use safety checkers to filter unsafe user prompts, but these checkers can be bypassed, leading to the generation of unsafe images. 2. **Limitations of Existing Attack Methods**: - Some methods rely on white-box adversarial attacks against specific T2I models, where the generated attack prompts contain meaningless and hard-to-understand vocabulary, reducing the stealthiness of the attack. - Other methods require complex pipelines and extensive queries to the T2I model, resulting in high time and resource consumption. ### Proposed Method To address the above issues, the authors propose a Perception-Guided Jailbreak (PGJ) method. The main features of this method are as follows: 1. **Model Agnostic**: The PGJ method does not require a specific T2I model as the target, making it a model-agnostic approach. 2. **Naturalness**: The generated attack prompts are highly natural (stealthy) and do not contain meaningless vocabulary. 3. **Efficiency**: This method can automatically and efficiently find safe alternative phrases that satisfy perceptual similarity and textual semantic inconsistency. ### Key Concepts - **Perceptual Confusion**: Due to perceptual similarity, people may confuse objects or behaviors in images. For example, flour in an image might look like heroin. - **Perceptual Similarity and Textual Semantic Inconsistency Principle (PSTSI Principle)**: Safe alternative phrases should be perceptually similar to the target unsafe words but textually semantically inconsistent. ### Method Workflow 1. **Unsafe Word Selection**: Use large language models (LLM) to automatically detect unsafe words in user prompts. 2. **Word Replacement**: Find a safe alternative phrase that meets the PSTSI principle and replace the unsafe word with it to generate the attack prompt. ### Experimental Results - **Attack Success Rate (ASR)**: The PGJ method shows a higher attack success rate across various T2I models. - **Semantic Consistency (SC)**: The generated images are semantically consistent with the original unsafe user prompts. - **Prompt Perplexity (PPL)**: The generated attack prompts are natural, with a low PPL value. - **Time Efficiency**: The PGJ method processes attack prompts very quickly, significantly outperforming other methods. ### Conclusion By proposing the Perception-Guided Jailbreak (PGJ) method, this paper effectively addresses the safety issues of existing T2I models when generating unsafe content. This method not only has a high attack success rate but also generates natural and hard-to-detect attack prompts while being highly efficient.