Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Qizhang Li,Yiwen Guo,Wangmeng Zuo,Hao Chen
2024-11-01
Abstract:Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt generation and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench. This match rate is 33% higher than that of a very strong baseline known as GCG, demonstrating advanced discrete optimization for adversarial prompt generation against LLMs. In addition, without introducing obvious cost, the combination achieves >30% absolute increase in attack success rates compared with GCG when generating both query-specific (38% -> 68%) and universal adversarial prompts (26.68% -> 60.32%) for attacking the Llama-2-7B-Chat model on AdvBench. Code at: <a class="link-external link-https" href="https://github.com/qizhangli/Gradient-based-Jailbreak-Attacks" rel="external noopener nofollow">this https URL</a>.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the challenges encountered when generating adversarial examples, particularly for large-scale language models (LLMs) that have been securely aligned. Specifically, the paper focuses on how to generate effective adversarial prompts through gradient optimization methods to achieve automatic jailbreak attacks on securely aligned LLMs. ### Background and Motivation 1. **Generation of Adversarial Prompts**: - Adversarial prompts are carefully designed inputs that mislead the model into generating harmful content. These prompts can be manually designed or automatically generated. - Manually designing adversarial prompts requires a lot of work, while automatically generated adversarial prompts are more threatening because they can deceive the model on a large scale. 2. **Limitations of Existing Methods**: - Currently, gradient-based methods perform well in generating adversarial prompts, but the discrete nature of text makes it difficult for gradient optimization methods to accurately reflect the loss changes brought by word substitutions. - This results in limited attack success rates on some securely aligned LLMs (such as the Llama-2-Chat model) even when using gradient optimization methods in a white-box setting. ### Solution 1. **Introducing the Concept of Transfer Attacks**: - The paper draws on the concept of transfer-based attacks from image classification models, particularly the Skip Gradient Method (SGM) and Intermediate Level Attack (ILA). - These methods were originally used for black-box attacks, but the paper adapts them for gradient-optimized adversarial prompt generation to improve attack performance. 2. **Improved Gradient Optimization Methods**: - By reducing the gradient of the residual module, the paper proposes the Language SGM (LSGM) method to narrow the gap between input gradients and the actual effects of word substitutions. - By analyzing the correlation between intermediate layer representations and adversarial loss, the paper further improves the gradient optimization method to more effectively generate adversarial prompts. ### Experimental Results - **Performance Improvement**: - Experimental results show that the combination of LSGM and ILA methods achieves a success rate of 87% when generating query-specific adversarial suffixes, 33% higher than the baseline method GCG. - When generating general adversarial prompts, the attack success rate also significantly increased from 38% to 68%, and from 26.68% to 60.32%. - **Computational Cost**: - These improved methods significantly increase the attack success rate without adding significant computational costs. ### Conclusion By introducing the concept of transfer attacks, the paper successfully improves gradient optimization methods, enhancing the performance of generating adversarial prompts. These methods not only perform excellently in experiments but also provide new insights into solving discrete optimization problems in natural language processing models.