Abstract:Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is jailbreak attacks in generative models (especially text - to - image generative models based on diffusion models), especially these attacks can bypass existing security filtering mechanisms and generate inappropriate or harmful content (such as NSFW content). Although various security filtering measures have been implemented, existing attack methods still have problems such as low search efficiency, obvious attack characteristics, and poor alignment with the target. Specifically, the paper proposes a new attack strategy - Antelope, which aims to overcome existing challenges in the following ways: 1. **Improve attack concealment and alignment**: Antelope uses the method of confusing sensitive concepts with similar concepts, searches in the semantically adjacent space, and aligns these concepts with the target image, thereby generating sensitive images that meet the target and can evade detection. 2. **Enhance attack efficiency**: Compared with other methods, Antelope reduces the total search time by optimizing the search process (for example, setting a list of candidate words, setting an optimal threshold, and achieving early stopping). 3. **Verify the transferability of the attack**: Antelope successfully utilizes the transferability of model - based attacks to penetrate the security defenses of online black - box services. ### Main contributions of the paper - **Design and implement an efficient jailbreak attack strategy** Antelope to explore adversarial prompts that can bypass the security mechanisms of T2I models. - **Compare with multiple attack methods on multiple defense baselines**, demonstrating the superior performance and excellent robustness of Antelope. - **Extensive evaluation and analysis** show that Antelope has a low detection risk and high - semantic alignment when generating adversarial prompts. ### Experimental results The experimental results show that Antelope performs well under multiple defense mechanisms, especially in terms of attack success rate (ASR) and the quality of generated images (FID). In addition, tests of Antelope on online services (such as Midjourney and Leonardo.AI) also prove its strong concealment and adaptability. ### Key formulas - The adjustment formula for adversarial text embedding: \[ E_t = E_c - E_n + E_p \] where \(E_t\) is the adjusted text embedding, \(E_c\) is the embedding of the clean prompt, \(E_n\) is the negative embedding, and \(E_p\) is the positive embedding. - The text loss function: \[ L_{\text{txt}} = 1 - \cos(E_c || s, E_t) \] - The image loss function: \[ L_{\text{img}} = 1 - \cos(E_c || s, E_i) \] - The final optimization objective: \[ \min_s L = \gamma L_{\text{txt}} + (1 - \gamma) L_{\text{img}} \] where \(\gamma\) is a weighting factor that balances the text and image modality loss terms. Through these methods, Antelope not only improves the success rate of the attack but also ensures a high alignment between the generated image and the original intention, while effectively avoiding various security check mechanisms.

Antelope: Potent and Concealed Jailbreak Attack Strategy

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Perception-guided Jailbreak against Text-to-Image Models

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Multimodal Pragmatic Jailbreak on Text-to-image Models

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Attack as Defense: Run-time Backdoor Implantation for Image Content Protection

IDEATOR: Jailbreaking VLMs Using VLMs

SneakyPrompt: Jailbreaking Text-to-image Generative Models

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

Jailbreaking Text-to-Image Models with LLM-Based Agents

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring