Abstract:While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively refine candidate (attack) prompts until one of the refined prompts jailbreaks the target. In addition, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks, reducing the number of queries sent to the target LLM. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. This significantly improves upon the previous state-of-the-art black-box methods for generating jailbreaks while using a smaller number of queries than them. Furthermore, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the safety issues of large language models (LLMs) in generating harmful, biased, and toxic content. Specifically, the paper focuses on how to automatically generate methods that can "jailbreak" black-box LLMs. "Jailbreaking" refers to using specific prompts to bypass the built-in safety filters of LLMs, thereby generating undesired information requested by the user (e.g., instructions on how to make a bomb). ### Background and Motivation Although large language models demonstrate versatility in natural language processing and generation, they still generate harmful, biased, and toxic content. This issue is partly due to the existence of human-designed "jailbreak" methods that can bypass the safety mechanisms of LLMs. Existing jailbreak methods usually require significant human intervention or are only applicable to open-source models, and the generated prompts often lack natural meaning, making them easy to detect. Therefore, the paper proposes an automated, black-box access method that generates interpretable prompts to reveal potential vulnerabilities in LLM alignment methods. ### Solution The paper proposes a method called "Tree of Attacks with Pruning" (TAP), which has the following features: 1. **Automation**: No human supervision is required. 2. **Black-box access**: Only query access to the LLM is needed, without knowledge of its internal parameters. 3. **Interpretability**: The generated prompts have natural meaning and are difficult to detect. The workflow of TAP is as follows: 1. **Branching**: The attacker LLM generates multiple variants of the initial prompt. 2. **Pruning (First Stage)**: The evaluator LLM assesses these variants and removes those less likely to trigger a jailbreak. 3. **Attack and Evaluation**: The remaining variants are sent to the target LLM, and their responses are evaluated to see if a successful jailbreak occurs. If a successful jailbreak prompt is found, it is returned. 4. **Pruning (Second Stage)**: If no successful jailbreak prompt is found, the highest-scoring prompts are retained for the next iteration of the attack attempt. ### Experimental Results The paper evaluates the performance of TAP on two datasets: an existing AdvBench Subset and a newly generated dataset. The experimental results show that TAP achieves significantly higher success rates on most LLMs while using fewer queries. For example, on GPT4o, TAP increased the success rate from 78% to 94% and reduced the number of queries by 60%. ### Significance The proposal of TAP not only reveals potential vulnerabilities in existing alignment methods but also provides an automated way to generate jailbreak prompts, which is significant for improving the safety and alignment methods of LLMs. Additionally, the high efficiency and interpretability of TAP make it more threatening in practical applications but also provide researchers with valuable tools to test and improve the safety of LLMs.

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Jailbreaking Black Box Large Language Models in Twenty Queries

Distract Large Language Models for Automatic Jailbreak Attack

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily