Abstract:In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the safety issues of Text-to-Image (T2I) models when generating inappropriate or unsafe content (such as adult content, violence, and politically sensitive materials). Although current T2I models have integrated safety checkers to prevent the generation of these unsafe images, these safety checkers still have the risk of being bypassed. ### Specific Problem Description 1. **Effectiveness of Safety Checkers**: Existing T2I models use safety checkers to filter unsafe user prompts, but these checkers can be bypassed, leading to the generation of unsafe images. 2. **Limitations of Existing Attack Methods**: - Some methods rely on white-box adversarial attacks against specific T2I models, where the generated attack prompts contain meaningless and hard-to-understand vocabulary, reducing the stealthiness of the attack. - Other methods require complex pipelines and extensive queries to the T2I model, resulting in high time and resource consumption. ### Proposed Method To address the above issues, the authors propose a Perception-Guided Jailbreak (PGJ) method. The main features of this method are as follows: 1. **Model Agnostic**: The PGJ method does not require a specific T2I model as the target, making it a model-agnostic approach. 2. **Naturalness**: The generated attack prompts are highly natural (stealthy) and do not contain meaningless vocabulary. 3. **Efficiency**: This method can automatically and efficiently find safe alternative phrases that satisfy perceptual similarity and textual semantic inconsistency. ### Key Concepts - **Perceptual Confusion**: Due to perceptual similarity, people may confuse objects or behaviors in images. For example, flour in an image might look like heroin. - **Perceptual Similarity and Textual Semantic Inconsistency Principle (PSTSI Principle)**: Safe alternative phrases should be perceptually similar to the target unsafe words but textually semantically inconsistent. ### Method Workflow 1. **Unsafe Word Selection**: Use large language models (LLM) to automatically detect unsafe words in user prompts. 2. **Word Replacement**: Find a safe alternative phrase that meets the PSTSI principle and replace the unsafe word with it to generate the attack prompt. ### Experimental Results - **Attack Success Rate (ASR)**: The PGJ method shows a higher attack success rate across various T2I models. - **Semantic Consistency (SC)**: The generated images are semantically consistent with the original unsafe user prompts. - **Prompt Perplexity (PPL)**: The generated attack prompts are natural, with a low PPL value. - **Time Efficiency**: The PGJ method processes attack prompts very quickly, significantly outperforming other methods. ### Conclusion By proposing the Perception-Guided Jailbreak (PGJ) method, this paper effectively addresses the safety issues of existing T2I models when generating unsafe content. This method not only has a high attack success rate but also generates natural and hard-to-detect attack prompts while being highly efficient.

Perception-guided Jailbreak against Text-to-Image Models

Multimodal Pragmatic Jailbreak on Text-to-image Models

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Jailbreaking Text-to-Image Models with LLM-Based Agents

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

IDEATOR: Jailbreaking VLMs Using VLMs

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Jailbreaking Attack against Multimodal Large Language Model

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Antelope: Potent and Concealed Jailbreak Attack Strategy

PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models

Efficient LLM-Jailbreaking by Introducing Visual Modality

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models