DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Xirui Li,Ruochen Wang,Minhao Cheng,Tianyi Zhou,Cho-Jui Hsieh

2024-02-26

Abstract:The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%.

Artificial Intelligence,Computation and Language,Cryptography and Security

What problem does this paper attempt to address?

The paper attempts to address the issue of safety alignment in large language models (LLMs), specifically how to bypass LLMs' safety mechanisms and trigger them to generate harmful content by improving attack methods. Specifically, existing LLM attack methods are prone to detection and rejection because they embed the entire malicious prompt as a whole. Therefore, this study proposes a new attack framework, DrAttack, which effectively hides malicious intent and improves attack success rates by decomposing the malicious prompt into multiple sub-prompts and implicitly recombining these sub-prompts using In-Context Learning (ICL). The specific steps of DrAttack include: 1. **Decomposition**: Decomposing the malicious prompt into seemingly neutral sub-prompts through semantic parsing. 2. **Implicit Recombination**: Implicitly recombining these sub-prompts in a harmless and semantically similar example context through In-Context Learning (ICL). 3. **Synonym Search**: Searching for synonyms of the sub-prompts to further improve attack efficiency and success rates. Experiments have shown that DrAttack significantly improves attack success rates on various open-source and closed-source LLMs. Notably, it achieved a 78.0% success rate on GPT-4 with only 15 queries, which is 33.1% higher than previous methods. Additionally, this framework can effectively bypass various defense mechanisms, such as OpenAI's content moderation tools and perplexity filters.

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Distract Large Language Models for Automatic Jailbreak Attack

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation

DROJ: A Prompt-Driven Attack against Large Language Models

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Multi-round jailbreak attack on large language models

ObscurePrompt: Jailbreaking Large Language Models via Obscure Input

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

FlipAttack: Jailbreak LLMs via Flipping