Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Anay Mehrotra,Manolis Zampetakis,Paul Kassianik,Blaine Nelson,Hyrum Anderson,Yaron Singer,Amin Karbasi
2024-10-31
Abstract:While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively refine candidate (attack) prompts until one of the refined prompts jailbreaks the target. In addition, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks, reducing the number of queries sent to the target LLM. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. This significantly improves upon the previous state-of-the-art black-box methods for generating jailbreaks while using a smaller number of queries than them. Furthermore, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
Machine Learning,Artificial Intelligence,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the safety issues of large language models (LLMs) in generating harmful, biased, and toxic content. Specifically, the paper focuses on how to automatically generate methods that can "jailbreak" black-box LLMs. "Jailbreaking" refers to using specific prompts to bypass the built-in safety filters of LLMs, thereby generating undesired information requested by the user (e.g., instructions on how to make a bomb). ### Background and Motivation Although large language models demonstrate versatility in natural language processing and generation, they still generate harmful, biased, and toxic content. This issue is partly due to the existence of human-designed "jailbreak" methods that can bypass the safety mechanisms of LLMs. Existing jailbreak methods usually require significant human intervention or are only applicable to open-source models, and the generated prompts often lack natural meaning, making them easy to detect. Therefore, the paper proposes an automated, black-box access method that generates interpretable prompts to reveal potential vulnerabilities in LLM alignment methods. ### Solution The paper proposes a method called "Tree of Attacks with Pruning" (TAP), which has the following features: 1. **Automation**: No human supervision is required. 2. **Black-box access**: Only query access to the LLM is needed, without knowledge of its internal parameters. 3. **Interpretability**: The generated prompts have natural meaning and are difficult to detect. The workflow of TAP is as follows: 1. **Branching**: The attacker LLM generates multiple variants of the initial prompt. 2. **Pruning (First Stage)**: The evaluator LLM assesses these variants and removes those less likely to trigger a jailbreak. 3. **Attack and Evaluation**: The remaining variants are sent to the target LLM, and their responses are evaluated to see if a successful jailbreak occurs. If a successful jailbreak prompt is found, it is returned. 4. **Pruning (Second Stage)**: If no successful jailbreak prompt is found, the highest-scoring prompts are retained for the next iteration of the attack attempt. ### Experimental Results The paper evaluates the performance of TAP on two datasets: an existing AdvBench Subset and a newly generated dataset. The experimental results show that TAP achieves significantly higher success rates on most LLMs while using fewer queries. For example, on GPT4o, TAP increased the success rate from 78% to 94% and reduced the number of queries by 60%. ### Significance The proposal of TAP not only reveals potential vulnerabilities in existing alignment methods but also provides an automated way to generate jailbreak prompts, which is significant for improving the safety and alignment methods of LLMs. Additionally, the high efficiency and interpretability of TAP make it more threatening in practical applications but also provide researchers with valuable tools to test and improve the safety of LLMs.