Diversity Helps Jailbreak Large Language Models

Weiliang Zhao,Daniel Ben-Levi,Junfeng Yang,Chengzhi Mao
2024-11-07
Abstract:We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to reveal the security vulnerabilities of large language models (LLMs) in "jailbreak attacks" and proposes a new attack technique that bypasses security constraints by leveraging the LLM's ability to deviate from previous contexts to generate harmful outputs. Specifically, the paper finds that existing LLM security training methods may only mask the vulnerabilities rather than eliminate them completely. Therefore, the authors propose a method called "Diversified Attack Group Refinement" (DAGR), which significantly improves the success rate of attacks by diversifying and obfuscating attack prompts while reducing the number of queries required. ### Main Contributions 1. **Revealing the Flaws in Existing LLM Security Training**: The paper demonstrates through experiments that existing security training methods may only temporarily mask vulnerabilities without fundamentally solving the problem. 2. **Proposing the DAGR Method**: The DAGR method significantly improves the success rate of attacks by diversifying and obfuscating attack prompts and performs excellently on multiple popular LLMs. 3. **Efficient Attack Strategy**: The DAGR method not only far exceeds existing methods in attack success rate but also shows higher efficiency in terms of the number of queries and runtime. 4. **Wide Applicability**: The DAGR method performs well on both black-box and white-box models and is applicable to various types of LLMs. ### Experimental Results - **Attack Success Rate**: DAGR achieved a significant attack success rate on multiple LLMs, especially on state-of-the-art LLMs like GPT-4 and GPT-4o, with a success rate over 57% higher than existing methods. - **Number of Queries**: The number of queries required by DAGR is significantly reduced, with a reduction of up to 92%. - **Runtime**: DAGR also excels in task completion time, being 398.2 times, 20.8 times, and 4.6 times faster than existing methods. ### Method Overview 1. **Problem Definition and Evaluation Criteria**: The paper defines two key scoring functions: a binary scoring function Sj for evaluating whether a jailbreak attempt is successful and a binary scoring function So for evaluating whether a prompt is relevant to the target. 2. **Scoring Function Design**: The paper improves existing scoring functions to more accurately identify jailbreak attempts embedded in harmless narratives. 3. **Attack Strategy Optimization**: The DAGR method systematically searches for potential jailbreak paths by diversifying and obfuscating attack prompts. At each depth level, the attacker model generates a diversified root prompt and multiple obfuscated leaf prompts to ensure thorough local search. 4. **Implementation Details**: The paper details the design of diversified and obfuscated system prompts and how to use memory and chain-of-thought to generate effective attack prompts. ### Conclusion By proposing the DAGR method, this paper reveals the shortcomings of existing LLM security training methods and provides an efficient, diversified attack strategy. These findings highlight the need to reassess and improve current LLM security testing methods to ensure the safety and reliability of LLMs.