BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Yige Li,Hanxun Huang,Yunhan Zhao,Xingjun Ma,Jun Sun
2024-08-23
Abstract:Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark for studying backdoor attacks on LLMs. \textit{BackdoorLLM} features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and 4) key insights into the effectiveness and limitations of backdoors in LLMs. We hope \textit{BackdoorLLM} will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at \url{<a class="link-external link-https" href="https://github.com/bboylyg/BackdoorLLM" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of large language models (LLMs) to backdoor attacks in generation tasks. Specifically, the paper focuses on how to manipulate LLMs to generate responses desired by attackers through specific trigger words (triggers). This threat has been widely overlooked in the field of natural language processing, especially in generation tasks. To meet this challenge, the authors introduced **BackdoorLLM**, the first comprehensive benchmarking framework for LLM backdoor attacks. The main contributions of BackdoorLLM include: 1. **Benchmark Library**: Provides a backdoor attack benchmark library with a standardized training process. 2. **Diverse Attack Strategies**: Covers multiple attack methods such as data poisoning, weight poisoning, hidden - state attacks, and chain - of - thought attacks. 3. **Extensive Evaluation**: Conducted more than 200 experiments, involving 8 different attack methods, 7 application scenarios, and 6 model architectures. 4. **Key Insights**: Revealed the effectiveness and limitations of backdoor attacks in LLMs, providing an important reference for future defense methods. ### Main Findings 1. **Effectiveness of Backdoor Attacks**: Backdoor attacks have shown a relatively high success rate on multiple LLMs, especially in data poisoning attacks. 2. **Exacerbating Inherent Vulnerabilities**: Even in models that have undergone strict security alignment, backdoor attacks can significantly increase the success rate of jailbreak attacks. 3. **Model Capacity and Resistance**: Larger models (such as Llama - 3 - 8b) have higher resistance to weight poisoning attacks. 4. **Limitations of Activation Steering**: Hidden - state attacks have poor generality and transferability between different tasks. 5. **Models with Strong Reasoning Abilities Are More Vulnerable**: LLMs with stronger reasoning abilities are more likely to be affected by chain - of - thought attacks, while models with weaker reasoning abilities are "too naive" to be effectively attacked. 6. **Limited Detection and Mitigation Capabilities of GPT - 4**: GPT - 4 performs poorly in detecting and mitigating backdoor prompts. ### Goals BackdoorLLM aims to raise awareness of LLM backdoor threats and promote the progress of AI security. By providing a comprehensive benchmarking framework, researchers and practitioners can better understand the risks of backdoor attacks and develop more effective defense strategies.