AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

Lin Lu,Hai Yan,Zenghui Yuan,Jiawen Shi,Wenqi Wei,Pin-Yu Chen,Pan Zhou

2024-06-06

Abstract:Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalability. In this paper, we present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques, generalizing them to all possible attack surfaces. We employ directed acyclic graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and evaluation methodologies, and propose three comprehensive, automated, and logical frameworks. \texttt{AutoAttack} investigates dependencies in two lines of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and adversarial-generation-based attacks, respectively. We then introduce an ensemble jailbreak attack to exploit these dependencies. \texttt{AutoDefense} offers a mixture-of-defenders approach by leveraging the dependency relationships in pre-generative and post-generative defense strategies. \texttt{AutoEvaluation} introduces a novel evaluation method that distinguishes hallucinations, which are often overlooked, from jailbreak attack and defense responses. Through extensive experiments, we demonstrate that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.

Cryptography and Security

What problem does this paper attempt to address?

The paper focuses on the issue of jailbreaking attacks in large-scale language models (LLMs), where malicious prompts are used to induce the model to generate unethical and illegal content, posing a threat to the security of LLMs. Current attack and defense strategies are often limited to local optimizations within specific algorithm frameworks, resulting in poor performance and limited scalability. The paper proposes a systematic analysis approach to understand and generalize jailbreaking attacks and defense techniques from a dependency perspective, using directed acyclic graphs (DAGs) to locate and analyze existing attack, defense, and evaluation methods. The authors propose three automated frameworks: AutoAttack, AutoDefense, and AutoEvaluation. AutoAttack investigates jailbreaking attacks with two optimization strategies based on genetic algorithms (GA) and adversarial generation, and creates an integrated attack. AutoDefense adopts a hybrid defender approach that leverages the dependency relationship between pre-generation and post-generation defense strategies. AutoEvaluation introduces a novel evaluation method that distinguishes between illusionary responses and jailbreaking attack and defense responses. The paper demonstrates the superior performance of the proposed integrated jailbreaking attack and defense framework through extensive experiments, significantly outperforming existing methods. It does not target specific types of jailbreaking prompts but enhances the overall security of LLMs. However, the paper emphasizes that AutoJailbreak is not the ultimate solution for attack and defense but should serve as a benchmark for new attack and defense approaches.

AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Distract Large Language Models for Automatic Jailbreak Attack

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Defending Jailbreak Prompts via In-Context Adversarial Game

Comprehensive Assessment of Jailbreak Attacks Against LLMs

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models