GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu,Xingwei Lin,Zheng Yu,Xinyu Xing

2024-06-28

Abstract:Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security and robustness issues of large - language models (LLMs). Specifically, although LLMs perform excellently in various applications, they also pose the risk of generating harmful or illegal content. These risks are mainly realized through so - called "jailbreak attacks", that is, by carefully designed prompts to bypass the security mechanisms of LLMs and make them produce inappropriate content. Currently, most jailbreak attacks against LLMs rely on manually - created prompts. This method is not only time - consuming and labor - intensive, but also difficult to test on a large scale and cover all possible vulnerabilities. To solve these problems, the paper proposes an automated black - box jailbreak fuzz - testing framework named GPTF UZZER. The main objectives of GPTF UZZER are: 1. **Automatically generate jailbreak prompts**: By automatically generating and mutating jailbreak prompts, improve the efficiency and coverage of testing, thereby more comprehensively evaluating the security of LLMs. 2. **Improve the scalability and adaptability of testing**: Traditional manual methods are difficult to cope with continuously updated and evolving LLMs, while GPTF UZZER can quickly adapt to new model versions and updates. 3. **Provide an efficient seed selection strategy**: By introducing multiple seed selection strategies (such as random selection, polling mode, UCB algorithm and MCTS - Explore), ensure an efficient and diversified testing process. 4. **Evaluate and verify**: Conduct extensive tests on multiple commercial and open - source LLMs to verify the effectiveness and universality of GPTF UZZER. Through these methods, GPTF UZZER aims to help researchers and practitioners more effectively evaluate and enhance the security of LLMs and reduce potential harm.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Distract Large Language Models for Automatic Jailbreak Attack

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Can Large Language Models Automatically Jailbreak GPT-4V?

Comprehensive Assessment of Jailbreak Attacks Against LLMs

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Playing Language Game with LLMs Leads to Jailbreaking

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Jailbreaking Black Box Large Language Models in Twenty Queries

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models