Abstract:We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to reveal the security vulnerabilities of large language models (LLMs) in "jailbreak attacks" and proposes a new attack technique that bypasses security constraints by leveraging the LLM's ability to deviate from previous contexts to generate harmful outputs. Specifically, the paper finds that existing LLM security training methods may only mask the vulnerabilities rather than eliminate them completely. Therefore, the authors propose a method called "Diversified Attack Group Refinement" (DAGR), which significantly improves the success rate of attacks by diversifying and obfuscating attack prompts while reducing the number of queries required. ### Main Contributions 1. **Revealing the Flaws in Existing LLM Security Training**: The paper demonstrates through experiments that existing security training methods may only temporarily mask vulnerabilities without fundamentally solving the problem. 2. **Proposing the DAGR Method**: The DAGR method significantly improves the success rate of attacks by diversifying and obfuscating attack prompts and performs excellently on multiple popular LLMs. 3. **Efficient Attack Strategy**: The DAGR method not only far exceeds existing methods in attack success rate but also shows higher efficiency in terms of the number of queries and runtime. 4. **Wide Applicability**: The DAGR method performs well on both black-box and white-box models and is applicable to various types of LLMs. ### Experimental Results - **Attack Success Rate**: DAGR achieved a significant attack success rate on multiple LLMs, especially on state-of-the-art LLMs like GPT-4 and GPT-4o, with a success rate over 57% higher than existing methods. - **Number of Queries**: The number of queries required by DAGR is significantly reduced, with a reduction of up to 92%. - **Runtime**: DAGR also excels in task completion time, being 398.2 times, 20.8 times, and 4.6 times faster than existing methods. ### Method Overview 1. **Problem Definition and Evaluation Criteria**: The paper defines two key scoring functions: a binary scoring function Sj for evaluating whether a jailbreak attempt is successful and a binary scoring function So for evaluating whether a prompt is relevant to the target. 2. **Scoring Function Design**: The paper improves existing scoring functions to more accurately identify jailbreak attempts embedded in harmless narratives. 3. **Attack Strategy Optimization**: The DAGR method systematically searches for potential jailbreak paths by diversifying and obfuscating attack prompts. At each depth level, the attacker model generates a diversified root prompt and multiple obfuscated leaf prompts to ensure thorough local search. 4. **Implementation Details**: The paper details the design of diversified and obfuscated system prompts and how to use memory and chain-of-thought to generate effective attack prompts. ### Conclusion By proposing the DAGR method, this paper reveals the shortcomings of existing LLM security training methods and provides an efficient, diversified attack strategy. These findings highlight the need to reassess and improve current LLM security testing methods to ensure the safety and reliability of LLMs.

Diversity Helps Jailbreak Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Distract Large Language Models for Automatic Jailbreak Attack

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaking Black Box Large Language Models in Twenty Queries

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Multi-round jailbreak attack on large language models

Low-Resource Languages Jailbreak GPT-4