Abstract:Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the security and reliability issues of text - to - image (T2I) models in terms of generating inappropriate or NSFW (Not - Safe - For - Work) content. Specifically, the paper focuses on how to bypass T2I models and their defense mechanisms to generate NSFW content, and promotes the development of more secure and reliable T2I systems by revealing the vulnerabilities of these models. #### Main problems: 1. **Risk of NSFW content generation**: Although existing T2I models have achieved remarkable success in image generation and editing, there are still security risks of generating inappropriate content (such as adult content, violence, self - harm, etc.). 2. **Limitations of existing attack methods**: Most existing attack methods rely on gradient optimization and regard T2I models as white - box systems. However, in actual scenarios, attackers usually cannot access the gradient information of the model, and modern defense mechanisms (such as gradient masking) make gradient - based attack methods ineffective. 3. **Challenges of black - box attacks**: Although previous research has explored black - box attacks, these methods are difficult to optimize in discrete spaces, resulting in limited performance and difficulty in effectively bypassing defense mechanisms. #### Solutions: To solve the above problems, the paper proposes a new heuristic token search attack method - HTS - Attack. This method regards the T2I model and its defense mechanism as a black - box system and achieves efficient attacks through the following steps: 1. **Initialization phase**: Remove sensitive words to reduce the number of queries and improve efficiency. 2. **Heuristic search**: Recombine and mutate high - performance candidate words to avoid falling into local optimal solutions and enhance the robustness of the attack. 3. **Experimental verification**: Verify the effectiveness of HTS - Attack through extensive experiments, including bypassing the latest prompt checkers, post - processing image checkers, safety - trained T2I models, and online commercial models. #### Markdown representation of formulas: To ensure the correctness and readability of formulas, the following are the formulas involved in the paper presented in Markdown format: - Objective function: \[ \begin{cases} F_\theta(p_{adv}) \neq 0 \\ \max \cos(T_\theta(p_{tar}), I_\theta(F_\theta(p_{adv}))) \end{cases} \] - Set of sensitive words: \[ T_{NSFW} = \{p_k | \exists s_i \in S, s_i \subseteq p_k\} \] - Updated set of sensitive words: \[ T_{NSFW} = T_{NSFW} \cup \{p_k | p_k \text{ is removed during iteration}\} \] - Text similarity filtering: \[ P_T = \{p | \cos(T_\theta(p), T_\theta(p_{tar})) > \xi_t, p \in P_C\} \] - Image semantic similarity calculation: \[ S_I = \frac{1}{K} \sum_{k = 1}^{K} \cos(I_\theta(F_\theta(p)), I_\theta(c_k)) + \cos(I_\theta(F_\theta(p)), T_\theta(p_{tar})) \] Through these methods, HTS - Attack can effectively bypass the defense mechanisms of T2I models without relying on gradient information and generate NSFW content with high semantic similarity. This not only reveals the vulnerabilities of current T2I models and defense mechanisms but also provides an important reference for the development of more powerful defense strategies.

HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Perception-guided Jailbreak against Text-to-Image Models

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Multimodal Pragmatic Jailbreak on Text-to-image Models

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Jailbreaking Text-to-Image Models with LLM-Based Agents

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Combinational Backdoor Attack against Customized Text-to-Image Models

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

Antelope: Potent and Concealed Jailbreak Attack Strategy

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Towards Making a Trojan-horse Attack on Text-to-Image Retrieval

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts