Abstract:Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark for studying backdoor attacks on LLMs. \textit{BackdoorLLM} features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and 4) key insights into the effectiveness and limitations of backdoors in LLMs. We hope \textit{BackdoorLLM} will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at \url{<a class="link-external link-https" href="https://github.com/bboylyg/BackdoorLLM" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerability of large language models (LLMs) to backdoor attacks in generation tasks. Specifically, the paper focuses on how to manipulate LLMs to generate responses desired by attackers through specific trigger words (triggers). This threat has been widely overlooked in the field of natural language processing, especially in generation tasks. To meet this challenge, the authors introduced **BackdoorLLM**, the first comprehensive benchmarking framework for LLM backdoor attacks. The main contributions of BackdoorLLM include: 1. **Benchmark Library**: Provides a backdoor attack benchmark library with a standardized training process. 2. **Diverse Attack Strategies**: Covers multiple attack methods such as data poisoning, weight poisoning, hidden - state attacks, and chain - of - thought attacks. 3. **Extensive Evaluation**: Conducted more than 200 experiments, involving 8 different attack methods, 7 application scenarios, and 6 model architectures. 4. **Key Insights**: Revealed the effectiveness and limitations of backdoor attacks in LLMs, providing an important reference for future defense methods. ### Main Findings 1. **Effectiveness of Backdoor Attacks**: Backdoor attacks have shown a relatively high success rate on multiple LLMs, especially in data poisoning attacks. 2. **Exacerbating Inherent Vulnerabilities**: Even in models that have undergone strict security alignment, backdoor attacks can significantly increase the success rate of jailbreak attacks. 3. **Model Capacity and Resistance**: Larger models (such as Llama - 3 - 8b) have higher resistance to weight poisoning attacks. 4. **Limitations of Activation Steering**: Hidden - state attacks have poor generality and transferability between different tasks. 5. **Models with Strong Reasoning Abilities Are More Vulnerable**: LLMs with stronger reasoning abilities are more likely to be affected by chain - of - thought attacks, while models with weaker reasoning abilities are "too naive" to be effectively attacked. 6. **Limited Detection and Mitigation Capabilities of GPT - 4**: GPT - 4 performs poorly in detecting and mitigating backdoor prompts. ### Goals BackdoorLLM aims to raise awareness of LLM backdoor threats and promote the progress of AI security. By providing a comprehensive benchmarking framework, researchers and practitioners can better understand the risks of backdoor attacks and develop more effective defense strategies.

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

B3: Backdoor Attacks Against Black-box Machine Learning Models

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Neutralizing Backdoors through Information Conflicts for Large Language Models

Test-Time Backdoor Attacks on Multimodal Large Language Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Composite Backdoor Attacks Against Large Language Models

Data Stealing Attacks against Large Language Models via Backdooring

Instruction Backdoor Attacks Against Customized LLMs

BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning

MEGen: Generative Backdoor in Large Language Models via Model Editing

A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs

Weak-to-Strong Backdoor Attack for Large Language Models

Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Transferring Backdoors between Large Language Models by Knowledge Distillation

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models