h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

Moussa Koulako Bala Doumbouya,Ananjan Nandi,Gabriel Poesia,Davide Ghilardi,Anna Goldie,Federico Bianchi,Dan Jurafsky,Christopher D. Manning

2024-09-13

Abstract:The safety of Large Language Models (LLMs) remains a critical concern due to a lack of adequate benchmarks for systematically evaluating their ability to resist generating harmful content. Previous efforts towards automated red teaming involve static or templated sets of illicit requests and adversarial prompts which have limited utility given jailbreak attacks' evolving and composable nature. We propose a novel dynamic benchmark of composable jailbreak attacks to move beyond static datasets and taxonomies of attacks and harms. Our approach consists of three components collectively called h4rm3l: (1) a domain-specific language that formally expresses jailbreak attacks as compositions of parameterized prompt transformation primitives, (2) bandit-based few-shot program synthesis algorithms that generate novel attacks optimized to penetrate the safety filters of a target black box LLM, and (3) open-source automated red-teaming software employing the previous two components. We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs. Several of our synthesized attacks are more effective than previously reported ones, with Attack Success Rates exceeding 90% on SOTA closed language models such as claude-3-haiku and GPT4-o. By generating datasets of jailbreak attacks in a unified formal representation, h4rm3l enables reproducible benchmarking and automated red-teaming, contributes to understanding LLM safety limitations, and supports the development of robust defenses in an increasingly LLM-integrated world. Warning: This paper and related research artifacts contain offensive and potentially disturbing prompts and model-generated content.

Cryptography and Security,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the security issues of large language models (LLMs) when faced with malicious attacks. Specifically, existing evaluation methods largely rely on static or templated illicit requests and adversarial prompts, which are of limited utility against the ever-evolving and composable jailbreak attacks. Therefore, the authors propose a new dynamic benchmarking framework—h4rm3l, designed to systematically evaluate the ability of LLMs to resist generating harmful content. h4rm3l mainly consists of three parts: 1. **Domain-Specific Language (DSL)**: Used to formally express the parameterized prompt transformation primitives of jailbreak attacks. 2. **Bandit-based Few-shot Program Synthesis Algorithm**: Generates new attacks optimized for specific black-box LLMs. 3. **Open-source Automated Red Team Software**: Utilizes the above two techniques for automated red team testing. In this way, h4rm3l is able to generate 2656 successful new jailbreak attacks against six of the most advanced open-source and proprietary LLMs, with an attack success rate exceeding 90%, thereby helping to reveal the security limitations of LLMs and supporting the development of more robust defense measures.

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Distract Large Language Models for Automatic Jailbreak Attack

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Jailbreaking Black Box Large Language Models in Twenty Queries

A Realistic Threat Model for Large Language Model Jailbreaks