Abstract:Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two close-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8\% to 74\%. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while current classifiers may be effective for single modality detection, they fail to work against our jailbreak. Our work provides a foundation for further development towards more secure and reliable T2I models.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly focuses on the security issues of text - to - image (T2I) models when generating images, especially the situation where these models may generate unsafe content after being "jailbroken". Specifically, the author introduced a new form of jailbreak, called **Multimodal Pragmatic Jailbreak**, which will trigger T2I models to generate images with visual text. Although these images and texts are safe when viewed separately, their combination may lead to unsafe content. #### Main problems: 1. **Multimodal Pragmatic Jailbreak Phenomenon**: - Current T2I models can generate high - quality images, and these images are highly consistent with text prompts. However, these models may also generate images containing visual text. Although these images seem safe on their own, their combination may produce unsafe content. - For example, a seemingly harmless image combined with a specific text description may convey offensive, hateful or other inappropriate information. 2. **Limitations of Existing Security Filters**: - The paper evaluated the effectiveness of existing unimodal security filters (such as keyword blacklists, customized prompt filters and NSFW image filters) against this mult - imodal pragmatic jailbreak. The results show that these simple classifiers have significant deficiencies in identifying complex safe content. 3. **Risks in Practical Applications**: - In the real world, various security detection methods (such as text prompt filters and image security classifiers) are widely used to filter potentially harmful content. However, the effectiveness of these methods against mult - imodal pragmatic jailbreak is still limited, especially when dealing with complex language and visual interactions. #### Solutions: To systematically study this phenomenon, the author proposed the following tasks: 1. **Constructing a Dataset**: - The author developed a dataset named **Multimodal Pragmatic Unsafe Prompt Dataset (MPUP)**, which contains 1,200 unsafe prompts, covering three categories: hate speech, physical harm and fraud. 2. **Benchmark Testing**: - The author conducted benchmark tests on nine representative T2I models, including two closed - source commercial models. The experimental results show that all tested models are vulnerable to this mult - imodal pragmatic jailbreak, with a success rate ranging from 8% to 74%. 3. **Improving Security Classifiers**: - In response to the deficiencies of existing security filters, the author proposed and implemented several simple mult - imodal pragmatic security classifiers to enhance the security of the models. 4. **Analyzing the Causes of Vulnerabilities**: - The author explored the potential weaknesses of the models in terms of training data and prompts, and explained why these models are vulnerable to mult - imodal pragmatic jailbreak. In general, this paper aims to reveal the security risks of current T2I models when generating multimodal content, and provide a basis and direction for the development of safer and more reliable T2I models.

Multimodal Pragmatic Jailbreak on Text-to-image Models

Perception-guided Jailbreak against Text-to-Image Models

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jailbreaking Text-to-Image Models with LLM-Based Agents

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

SneakyPrompt: Jailbreaking Text-to-image Generative Models

IDEATOR: Jailbreaking VLMs Using VLMs

Jailbreaking Attack against Multimodal Large Language Model

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Query-Relevant Images Jailbreak Large Multi-Modal Models

Gradient-based Jailbreak Images for Multimodal Fusion Models

The Art of Deception: Black-box Attack Against Text-to-Image Diffusion Model

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model