AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

Lin Lu,Hai Yan,Zenghui Yuan,Jiawen Shi,Wenqi Wei,Pin-Yu Chen,Pan Zhou
2024-06-06
Abstract:Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalability. In this paper, we present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques, generalizing them to all possible attack surfaces. We employ directed acyclic graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and evaluation methodologies, and propose three comprehensive, automated, and logical frameworks. \texttt{AutoAttack} investigates dependencies in two lines of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and adversarial-generation-based attacks, respectively. We then introduce an ensemble jailbreak attack to exploit these dependencies. \texttt{AutoDefense} offers a mixture-of-defenders approach by leveraging the dependency relationships in pre-generative and post-generative defense strategies. \texttt{AutoEvaluation} introduces a novel evaluation method that distinguishes hallucinations, which are often overlooked, from jailbreak attack and defense responses. Through extensive experiments, we demonstrate that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
Cryptography and Security
What problem does this paper attempt to address?
The paper focuses on the issue of jailbreaking attacks in large-scale language models (LLMs), where malicious prompts are used to induce the model to generate unethical and illegal content, posing a threat to the security of LLMs. Current attack and defense strategies are often limited to local optimizations within specific algorithm frameworks, resulting in poor performance and limited scalability. The paper proposes a systematic analysis approach to understand and generalize jailbreaking attacks and defense techniques from a dependency perspective, using directed acyclic graphs (DAGs) to locate and analyze existing attack, defense, and evaluation methods. The authors propose three automated frameworks: AutoAttack, AutoDefense, and AutoEvaluation. AutoAttack investigates jailbreaking attacks with two optimization strategies based on genetic algorithms (GA) and adversarial generation, and creates an integrated attack. AutoDefense adopts a hybrid defender approach that leverages the dependency relationship between pre-generation and post-generation defense strategies. AutoEvaluation introduces a novel evaluation method that distinguishes between illusionary responses and jailbreaking attack and defense responses. The paper demonstrates the superior performance of the proposed integrated jailbreaking attack and defense framework through extensive experiments, significantly outperforming existing methods. It does not target specific types of jailbreaking prompts but enhances the overall security of LLMs. However, the paper emphasizes that AutoJailbreak is not the ultimate solution for attack and defense but should serve as a benchmark for new attack and defense approaches.