A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu,Yi Liu,Gelei Deng,Yuekang Li,Stjepan Picek
2024-05-17
Abstract:Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the security issues of large language models (LLMs) when facing "jailbreak attacks." Specifically: 1. **Research Background and Objectives**: As the application of large language models becomes increasingly widespread, the content generated by these models may have potential impacts on society. However, LLMs may generate harmful content, and researchers have adopted various safety training techniques to align model outputs to avoid generating malicious content. Despite this, "jailbreak attacks" remain a significant challenge, where carefully designed prompts bypass the model's safety measures to generate harmful information. 2. **Research Methods**: The paper systematically analyzes existing jailbreak attacks and defense techniques and conducts a comprehensive evaluation of nine attack techniques and seven defense techniques on three different language models (Vicuna, LLama, GPT-3.5 Turbo). The goal is to assess the effectiveness of these attack and defense techniques. 3. **Main Findings**: - Existing white-box attacks perform poorly against general techniques. - The presence of special tokens (such as ‘[/INST]’) in the input significantly affects the success rate of attacks. - The study reveals the effectiveness differences of existing defense strategies, with the Bergeron method identified as one of the most effective defense strategies, while other methods either completely fail to prevent jailbreak attacks or are too strict, leading to false positives on normal prompts. 4. **Contributions**: The paper provides the first systematic study to comprehensively evaluate the effectiveness of jailbreak attack and defense techniques and publicly releases a benchmark dataset and testing framework to promote further research in the field of LLMs security.