Abstract:As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks -- malicious prompts that can disable the safety mechanism of LLMs -- has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerability of large - language models (LLMs) in safety - critical applications, especially their susceptibility to so - called "jailbreak attacks". Jailbreak attacks refer to the act of making LLMs bypass their built - in safety mechanisms through carefully designed malicious prompts and generate content that violates policy regulations. Although alignment methods have been proposed to protect LLMs from jailbreak attacks, these methods still have loopholes, making it possible for aligned LLMs to be jailbroken as well. Specifically, the paper points out that existing jailbreak attacks can be divided into two categories: 1. **Prompt - level - based attacks**: Evade safety alignment by constructing stories or logic. 2. **Token - level - based attacks**: Use gradient methods to find adversarial tokens. To address this problem, the author introduces a new attack paradigm - **Ensemble Jailbreak (EnJa)**, which combines the advantages of prompt - level and token - level attacks to form a more powerful hybrid attack method. The main goal of EnJa is to improve the effectiveness and efficiency of the attack while reducing the number of required queries. ### Main contributions 1. **Propose the EnJa framework**: Combine black - box attacks with prompt optimization and white - box adversarial attacks based on gradients to create more effective and efficient jailbreak attacks. 2. **Design an ensemble connector**: Combine prompt - level and token - level attacks through design templates to enhance the attack strength and ensure coherence. 3. **Improve attack strategies**: Include off - topic checking to ensure that prompts do not deviate from the topic, and introduce regret prevention loss to prevent the model from self - correcting. ### Experimental results The paper verifies the effectiveness of the EnJa attack through experiments, showing that its attack success rate on multiple open - source and commercial LLMs is significantly higher than that of existing methods, and there is also a significant improvement in attack efficiency. In summary, this paper aims to solve the security problem of LLMs in the face of jailbreak attacks, especially in safety - critical applications, by proposing a new attack paradigm, EnJa, to improve the success rate and efficiency of the attack.

EnJa: Ensemble Jailbreak on Large Language Models

Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaking Black Box Large Language Models in Twenty Queries

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Distract Large Language Models for Automatic Jailbreak Attack

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Multi-round jailbreak attack on large language models

Weak-to-Strong Jailbreaking on Large Language Models

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings