Antelope: Potent and Concealed Jailbreak Attack Strategy

Xin Zhao,Xiaojun Chen,Haoyu Gao
2024-12-11
Abstract:Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.
Cryptography and Security,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is jailbreak attacks in generative models (especially text - to - image generative models based on diffusion models), especially these attacks can bypass existing security filtering mechanisms and generate inappropriate or harmful content (such as NSFW content). Although various security filtering measures have been implemented, existing attack methods still have problems such as low search efficiency, obvious attack characteristics, and poor alignment with the target. Specifically, the paper proposes a new attack strategy - Antelope, which aims to overcome existing challenges in the following ways: 1. **Improve attack concealment and alignment**: Antelope uses the method of confusing sensitive concepts with similar concepts, searches in the semantically adjacent space, and aligns these concepts with the target image, thereby generating sensitive images that meet the target and can evade detection. 2. **Enhance attack efficiency**: Compared with other methods, Antelope reduces the total search time by optimizing the search process (for example, setting a list of candidate words, setting an optimal threshold, and achieving early stopping). 3. **Verify the transferability of the attack**: Antelope successfully utilizes the transferability of model - based attacks to penetrate the security defenses of online black - box services. ### Main contributions of the paper - **Design and implement an efficient jailbreak attack strategy** Antelope to explore adversarial prompts that can bypass the security mechanisms of T2I models. - **Compare with multiple attack methods on multiple defense baselines**, demonstrating the superior performance and excellent robustness of Antelope. - **Extensive evaluation and analysis** show that Antelope has a low detection risk and high - semantic alignment when generating adversarial prompts. ### Experimental results The experimental results show that Antelope performs well under multiple defense mechanisms, especially in terms of attack success rate (ASR) and the quality of generated images (FID). In addition, tests of Antelope on online services (such as Midjourney and Leonardo.AI) also prove its strong concealment and adaptability. ### Key formulas - The adjustment formula for adversarial text embedding: \[ E_t = E_c - E_n + E_p \] where \(E_t\) is the adjusted text embedding, \(E_c\) is the embedding of the clean prompt, \(E_n\) is the negative embedding, and \(E_p\) is the positive embedding. - The text loss function: \[ L_{\text{txt}} = 1 - \cos(E_c || s, E_t) \] - The image loss function: \[ L_{\text{img}} = 1 - \cos(E_c || s, E_i) \] - The final optimization objective: \[ \min_s L = \gamma L_{\text{txt}} + (1 - \gamma) L_{\text{img}} \] where \(\gamma\) is a weighting factor that balances the text and image modality loss terms. Through these methods, Antelope not only improves the success rate of the attack but also ensures a high alignment between the generated image and the original intention, while effectively avoiding various security check mechanisms.