Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Chung-En Sun,Xiaodong Liu,Weiwei Yang,Tsui-Wei Weng,Hao Cheng,Aidan San,Michel Galley,Jianfeng Gao
2024-10-26
Abstract:Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of large language models (LLMs) being susceptible to automated jailbreak attacks. Specifically, current methods face high computational costs and low attack success rates when generating adversarial suffixes, especially for well-aligned models like Llama2 and Llama3. To overcome these limitations, the authors propose ADV-LLM, an iterative self-tuning process capable of generating adversarial LLMs with stronger jailbreak capabilities. ### Main Issues: 1. **High computational cost**: Existing methods have high computational costs for generating adversarial suffixes. 2. **Low attack success rate**: Particularly for well-aligned models (such as Llama2 and Llama3), existing methods have a low attack success rate. 3. **Poor generalization**: Existing methods perform poorly when dealing with unseen harmful queries. 4. **Insufficient stealth**: The generated adversarial suffixes are often easily detectable due to their high perplexity. ### Solutions: - **Iterative self-tuning algorithm**: Gradually transforms a pre-trained LLM into ADV-LLMs through self-generated data. - **Efficient suffix generation**: Trained ADV-LLMs can generate a large number of adversarial suffixes within seconds, significantly reducing computational costs. - **High attack success rate**: ADV-LLM achieves a high attack success rate on both open-source and closed-source LLMs. - **Strong generalization**: ADV-LLM performs well on unseen harmful queries, indicating its effectiveness across various user-designed queries. - **High stealth**: The generated adversarial suffixes have low perplexity, making them difficult to detect. ### Experimental Results: - **Attack success rate**: ADV-LLM's attack success rate is significantly higher than other methods across multiple models, especially on GPT-3.5 and GPT-4. - **Generalization**: ADV-LLM performs excellently on unseen harmful queries, demonstrating strong generalization capabilities. - **Stealth**: The generated adversarial suffixes have low perplexity, making them difficult to detect. In summary, the paper effectively addresses the issues of high computational cost, low attack success rate, poor generalization, and insufficient stealth in generating adversarial suffixes by proposing the ADV-LLM framework.