HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Sensen Gao,Xiaojun Jia,Yihao Huang,Ranjie Duan,Jindong Gu,Yang Bai,Yang Liu,Qing Guo
2024-12-15
Abstract:Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.
Computer Vision and Pattern Recognition,Cryptography and Security
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the security and reliability issues of text - to - image (T2I) models in terms of generating inappropriate or NSFW (Not - Safe - For - Work) content. Specifically, the paper focuses on how to bypass T2I models and their defense mechanisms to generate NSFW content, and promotes the development of more secure and reliable T2I systems by revealing the vulnerabilities of these models. #### Main problems: 1. **Risk of NSFW content generation**: Although existing T2I models have achieved remarkable success in image generation and editing, there are still security risks of generating inappropriate content (such as adult content, violence, self - harm, etc.). 2. **Limitations of existing attack methods**: Most existing attack methods rely on gradient optimization and regard T2I models as white - box systems. However, in actual scenarios, attackers usually cannot access the gradient information of the model, and modern defense mechanisms (such as gradient masking) make gradient - based attack methods ineffective. 3. **Challenges of black - box attacks**: Although previous research has explored black - box attacks, these methods are difficult to optimize in discrete spaces, resulting in limited performance and difficulty in effectively bypassing defense mechanisms. #### Solutions: To solve the above problems, the paper proposes a new heuristic token search attack method - HTS - Attack. This method regards the T2I model and its defense mechanism as a black - box system and achieves efficient attacks through the following steps: 1. **Initialization phase**: Remove sensitive words to reduce the number of queries and improve efficiency. 2. **Heuristic search**: Recombine and mutate high - performance candidate words to avoid falling into local optimal solutions and enhance the robustness of the attack. 3. **Experimental verification**: Verify the effectiveness of HTS - Attack through extensive experiments, including bypassing the latest prompt checkers, post - processing image checkers, safety - trained T2I models, and online commercial models. #### Markdown representation of formulas: To ensure the correctness and readability of formulas, the following are the formulas involved in the paper presented in Markdown format: - Objective function: \[ \begin{cases} F_\theta(p_{adv}) \neq 0 \\ \max \cos(T_\theta(p_{tar}), I_\theta(F_\theta(p_{adv}))) \end{cases} \] - Set of sensitive words: \[ T_{NSFW} = \{p_k | \exists s_i \in S, s_i \subseteq p_k\} \] - Updated set of sensitive words: \[ T_{NSFW} = T_{NSFW} \cup \{p_k | p_k \text{ is removed during iteration}\} \] - Text similarity filtering: \[ P_T = \{p | \cos(T_\theta(p), T_\theta(p_{tar})) > \xi_t, p \in P_C\} \] - Image semantic similarity calculation: \[ S_I = \frac{1}{K} \sum_{k = 1}^{K} \cos(I_\theta(F_\theta(p)), I_\theta(c_k)) + \cos(I_\theta(F_\theta(p)), T_\theta(p_{tar})) \] Through these methods, HTS - Attack can effectively bypass the defense mechanisms of T2I models without relying on gradient information and generate NSFW content with high semantic similarity. This not only reveals the vulnerabilities of current T2I models and defense mechanisms but also provides an important reference for the development of more powerful defense strategies.