Abstract:Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.

What problem does this paper attempt to address?

The paper attempts to address the issue of large language models (LLMs) being susceptible to automated jailbreak attacks. Specifically, current methods face high computational costs and low attack success rates when generating adversarial suffixes, especially for well-aligned models like Llama2 and Llama3. To overcome these limitations, the authors propose ADV-LLM, an iterative self-tuning process capable of generating adversarial LLMs with stronger jailbreak capabilities. ### Main Issues: 1. **High computational cost**: Existing methods have high computational costs for generating adversarial suffixes. 2. **Low attack success rate**: Particularly for well-aligned models (such as Llama2 and Llama3), existing methods have a low attack success rate. 3. **Poor generalization**: Existing methods perform poorly when dealing with unseen harmful queries. 4. **Insufficient stealth**: The generated adversarial suffixes are often easily detectable due to their high perplexity. ### Solutions: - **Iterative self-tuning algorithm**: Gradually transforms a pre-trained LLM into ADV-LLMs through self-generated data. - **Efficient suffix generation**: Trained ADV-LLMs can generate a large number of adversarial suffixes within seconds, significantly reducing computational costs. - **High attack success rate**: ADV-LLM achieves a high attack success rate on both open-source and closed-source LLMs. - **Strong generalization**: ADV-LLM performs well on unseen harmful queries, indicating its effectiveness across various user-designed queries. - **High stealth**: The generated adversarial suffixes have low perplexity, making them difficult to detect. ### Experimental Results: - **Attack success rate**: ADV-LLM's attack success rate is significantly higher than other methods across multiple models, especially on GPT-3.5 and GPT-4. - **Generalization**: ADV-LLM performs excellently on unseen harmful queries, demonstrating strong generalization capabilities. - **Stealth**: The generated adversarial suffixes have low perplexity, making them difficult to detect. In summary, the paper effectively addresses the issues of high computational cost, low attack success rate, poor generalization, and insufficient stealth in generating adversarial suffixes by proposing the ADV-LLM framework.

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Weak-to-Strong Jailbreaking on Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Jailbreaking Black Box Large Language Models in Twenty Queries

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

Playing Language Game with LLMs Leads to Jailbreaking