Abstract:Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a growing number of red teaming approaches simulate potential adversarial scenarios by crafting jailbreak prompts to test the target LLM. However, existing red teaming methods do not consider the unique vulnerabilities of LLM in different scenarios, making it difficult to adjust the jailbreak prompts to find context-specific vulnerabilities. Meanwhile, these methods are limited to refining jailbreak templates using a few mutation operations, lacking the automation and scalability to adapt to different scenarios. To enable context-aware and efficient red teaming, we abstract and model existing attacks into a coherent concept called "jailbreak strategy" and propose a multi-agent LLM system named RedAgent that leverages these strategies to generate context-aware jailbreak prompts. By self-reflecting on contextual feedback in an additional memory buffer, RedAgent continuously learns how to leverage these strategies to achieve effective jailbreaks in specific contexts. Extensive experiments demonstrate that our system can jailbreak most black-box LLMs in just five queries, improving the efficiency of existing red teaming methods by two times. Additionally, RedAgent can jailbreak customized LLM applications more efficiently. By generating context-aware jailbreak prompts towards applications on GPTs, we discover 60 severe vulnerabilities of these real-world applications with only two queries per vulnerability. We have reported all found issues and communicated with OpenAI and Meta for bug fixes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issues of large - language models (LLMs) in practical applications, especially in response to "jailbreak attacks". These attacks, through carefully designed prompts, induce LLMs to generate harmful or inappropriate content, thereby bypassing their built - in security mechanisms. Specifically, the paper focuses on the following two main challenges: 1. **Lack of high - quality jailbreak prompts**: The jailbreak prompts generated by existing red - teaming methods often perform poorly in specific application scenarios. This is because specific LLM applications are usually fine - tuned or use additional data, resulting in different and unique weaknesses. Therefore, generating jailbreak prompts adapted to a specific "context" is an important but non - trivial goal of red - teaming methods. 2. **Lack of automation and scalability**: Existing red - teaming methods are limited to using a few mutation operations (such as synonym substitution and character splitting) to optimize manually - written jailbreak templates. This further limits the automation and scalability of generating prompts. Moreover, due to the long text length of LLM responses, existing methods can only store a small number of interaction records with the target model, which is inefficient when it is necessary to continuously adjust the context to discover the unique weaknesses of the target LLM. To address these challenges, the paper proposes a multi - agent system named RedAgent, which can automatically generate context - adapted jailbreak prompts and continuously improve the effectiveness of attacks through self - reflection and learning mechanisms. The main contributions of RedAgent include: - **Proposing a new "context - aware" jailbreak prompt generation technique** that can capture the context information of different LLM models and applications, thereby more effectively testing the jailbreak vulnerabilities of LLMs. - **Designing and implementing RedAgent**, an automated and efficient red - teaming method that can autonomously generate context - aware jailbreak prompts that combine multiple jailbreak strategies. Experimental results show that RedAgent can successfully jailbreak most black - box LLMs within only 5 queries, with twice the efficiency of existing methods. - **Evaluating 60 popular customized LLM applications** and discovering 60 real - world problems that may cause serious security impacts, especially those models that integrate external data or tools are more vulnerable to jailbreak attacks. In conclusion, this paper aims to improve the security testing capabilities of LLM applications through the RedAgent system, thereby better identifying and fixing potential security vulnerabilities.

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Distract Large Language Models for Automatic Jailbreak Attack

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Jailbreaking? One Step Is Enough!

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Defending Jailbreak Prompts via In-Context Adversarial Game

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

Automated Progressive Red Teaming

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction