A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu,Yi Liu,Gelei Deng,Yuekang Li,Stjepan Picek

2024-05-17

Abstract:Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

Cryptography and Security,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the security issues of large language models (LLMs) when facing "jailbreak attacks." Specifically: 1. **Research Background and Objectives**: As the application of large language models becomes increasingly widespread, the content generated by these models may have potential impacts on society. However, LLMs may generate harmful content, and researchers have adopted various safety training techniques to align model outputs to avoid generating malicious content. Despite this, "jailbreak attacks" remain a significant challenge, where carefully designed prompts bypass the model's safety measures to generate harmful information. 2. **Research Methods**: The paper systematically analyzes existing jailbreak attacks and defense techniques and conducts a comprehensive evaluation of nine attack techniques and seven defense techniques on three different language models (Vicuna, LLama, GPT-3.5 Turbo). The goal is to assess the effectiveness of these attack and defense techniques. 3. **Main Findings**: - Existing white-box attacks perform poorly against general techniques. - The presence of special tokens (such as ‘[/INST]’) in the input significantly affects the success rate of attacks. - The study reveals the effectiveness differences of existing defense strategies, with the Bergeron method identified as one of the most effective defense strategies, while other methods either completely fail to prevent jailbreak attacks or are too strict, leading to false positives on normal prompts. 4. **Contributions**: The paper provides the first systematic study to comprehensively evaluate the effectiveness of jailbreak attack and defense techniques and publicly releases a benchmark dataset and testing framework to promote further research in the field of LLMs security.

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

Distract Large Language Models for Automatic Jailbreak Attack

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Comprehensive Assessment of Jailbreak Attacks Against LLMs

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings