Abstract:The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the abuse of "jailbreak prompts" in large - language models (LLMs). Specifically, by constructing a new framework named JAILBREAK HUB, the paper conducts a comprehensive analysis of 1,405 existing jailbreak prompts, which span from December 2022 to December 2023. The researchers identify 131 jailbreak communities and discover the unique characteristics of jailbreak prompts and their main attack strategies, such as prompt injection and privilege escalation. In addition, they observe that jailbreak prompts are shifting from online Web communities to prompt aggregation websites, and 28 user accounts have been continuously optimizing jailbreak prompts for more than 100 days. To evaluate the potential harm caused by jailbreak prompts, the researchers create a set of 107,250 prohibited - question samples, covering 13 prohibited scenarios. Using this data set, they conduct experiments on six popular LLMs. The results show that the security mechanisms of these LLMs cannot fully defend against jailbreak prompts in all scenarios. In particular, they discover five highly effective jailbreak prompts, achieving an attack success rate of 0.95 on ChatGPT (GPT - 3.5) and GPT - 4, with the earliest prompt having been online for more than 240 days. The main findings of the study include: - Jailbreak prompts are becoming a trendy form of group attack against LLMs. In the research data, 803 user accounts participate in creating and sharing jailbreak prompts, among which 28 user accounts, on average, each plan 9 jailbreak prompts over more than 100 days. - To bypass security mechanisms, jailbreak prompts usually use combination techniques. For example, the length of jailbreak prompts is significantly longer, being 1.5 times longer on average than regular prompts and containing an average of 555 tokens. - Even though LLMs trained with Reinforcement Learning from Human Feedback (RLHF) show a certain resistance to prohibited questions, they have a weaker resistance to jailbreak prompts. Some jailbreak prompts can even achieve an attack success rate of 0.95 on these LLMs. - External security measures are limited in reducing the effectiveness of jailbreak prompts, indicating the need for enhanced and more adaptive defense mechanisms. In conclusion, this paper aims to systematically study and understand existing jailbreak prompts to promote the development of safer and more regulated LLMs.

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models

Comprehensive Assessment of Jailbreak Attacks Against LLMs

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Multi-round jailbreak attack on large language models

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

ObscurePrompt: Jailbreaking Large Language Models via Obscure Input