Abstract:Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a \textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to study and evaluate the effectiveness of backdoor data - poisoning attacks on large language models (LLMs) during the process of instruction fine - tuning. Specifically, the paper explores the following issues: 1. **Effectiveness of backdoor attacks**: - The impact of inserting backdoor trigger words or phrases into different positions (such as the beginning, the end, fixed positions, or random positions) in fine - tuning samples on the attack success rate (ASR). - The robustness of the attack, that is, whether the attack is still effective when the position of the trigger word changes during testing, or when some trigger words are replaced by synonyms. 2. **Transferability of cross - domain attacks**: - How effective is the backdoor attack transferred from one fine - tuning domain (such as movie reviews) to another related domain (such as product reviews). 3. **Comparison between clean - label and dirty - label attacks**: - The difference in the effectiveness of clean - label attacks (only modifying the input without changing the label) and dirty - label attacks (modifying both the input and the label) under different conditions. 4. **Defense mechanisms**: - Propose and evaluate two defense methods: - **Defense during fine - tuning**: A method based on word - frequency analysis to identify potential backdoor trigger words. - **Defense after fine - tuning**: Downstream fine - tuning of the backdoor - attacked LLM by using a small defense data set. ### Research background With the wide application of generative AI and large language models (such as GPT - 4, DALL - E 3), concerns have been raised about the security and reliability of these models. In particular, backdoor attacks, by inserting malicious trigger words into the training data, make the model produce incorrect responses when encountering specific trigger words, which may lead to serious security problems. For example, backdoor attacks can change the sentiment orientation of responses, violate the censorship system, inject false content, or trigger meaningless responses (hallucinations). ### Research methods The paper systematically evaluates the effects of backdoor attacks under different configurations through a series of experiments and proposes two defense strategies. The experiments mainly focus on the FLAN - T5 series of models and use multiple sentiment classification data sets (such as SST2, IMDB, Yelp Polarity, Amazon Polarity) for evaluation. ### Main findings - **Impact of trigger word positions**: When the trigger word is at the beginning or the end of the text, the attack success rate is higher and more transferable; while the effects of random or fixed positions are poorer. - **Robustness**: Replacing some trigger words or synonyms will significantly reduce the attack success rate. - **Cross - domain transferability**: The transfer effect of backdoor attacks in different domains is limited. - **Defense effect**: The defense method based on word - frequency analysis can effectively detect and identify backdoor trigger words. ### Conclusion The paper reveals the effectiveness and limitations of backdoor attacks during the instruction fine - tuning process and proposes effective defense measures. This is of great significance for improving the security and reliability of large language models.

A Study of Backdoors in Instruction Fine-tuned Language Models

Learning to Poison Large Language Models During Instruction Tuning

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

Rethinking Backdoor Detection Evaluation for Language Models

Fine-Tuning Is All You Need to Mitigate Backdoor Attacks

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Poisoning Language Models During Instruction Tuning

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Stand-in Backdoor: A Stealthy and Powerful Backdoor Attack

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Backdoor Attacks for In-Context Learning with Language Models

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

Adversarial Fine-tuning for Backdoor Defense: Connect Adversarial Examples to Triggered Samples

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation