"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen,Zeyuan Chen,Michael Backes,Yun Shen,Yang Zhang
2024-05-15
Abstract:The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the abuse of "jailbreak prompts" in large - language models (LLMs). Specifically, by constructing a new framework named JAILBREAK HUB, the paper conducts a comprehensive analysis of 1,405 existing jailbreak prompts, which span from December 2022 to December 2023. The researchers identify 131 jailbreak communities and discover the unique characteristics of jailbreak prompts and their main attack strategies, such as prompt injection and privilege escalation. In addition, they observe that jailbreak prompts are shifting from online Web communities to prompt aggregation websites, and 28 user accounts have been continuously optimizing jailbreak prompts for more than 100 days. To evaluate the potential harm caused by jailbreak prompts, the researchers create a set of 107,250 prohibited - question samples, covering 13 prohibited scenarios. Using this data set, they conduct experiments on six popular LLMs. The results show that the security mechanisms of these LLMs cannot fully defend against jailbreak prompts in all scenarios. In particular, they discover five highly effective jailbreak prompts, achieving an attack success rate of 0.95 on ChatGPT (GPT - 3.5) and GPT - 4, with the earliest prompt having been online for more than 240 days. The main findings of the study include: - Jailbreak prompts are becoming a trendy form of group attack against LLMs. In the research data, 803 user accounts participate in creating and sharing jailbreak prompts, among which 28 user accounts, on average, each plan 9 jailbreak prompts over more than 100 days. - To bypass security mechanisms, jailbreak prompts usually use combination techniques. For example, the length of jailbreak prompts is significantly longer, being 1.5 times longer on average than regular prompts and containing an average of 555 tokens. - Even though LLMs trained with Reinforcement Learning from Human Feedback (RLHF) show a certain resistance to prohibited questions, they have a weaker resistance to jailbreak prompts. Some jailbreak prompts can even achieve an attack success rate of 0.95 on these LLMs. - External security measures are limited in reducing the effectiveness of jailbreak prompts, indicating the need for enhanced and more adaptive defense mechanisms. In conclusion, this paper aims to systematically study and understand existing jailbreak prompts to promote the development of safer and more regulated LLMs.