A Study of Backdoors in Instruction Fine-tuned Language Models

Jayaram Raghuram,George Kesidis,David J. Miller
2024-08-22
Abstract:Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a \textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.
Cryptography and Security,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to study and evaluate the effectiveness of backdoor data - poisoning attacks on large language models (LLMs) during the process of instruction fine - tuning. Specifically, the paper explores the following issues: 1. **Effectiveness of backdoor attacks**: - The impact of inserting backdoor trigger words or phrases into different positions (such as the beginning, the end, fixed positions, or random positions) in fine - tuning samples on the attack success rate (ASR). - The robustness of the attack, that is, whether the attack is still effective when the position of the trigger word changes during testing, or when some trigger words are replaced by synonyms. 2. **Transferability of cross - domain attacks**: - How effective is the backdoor attack transferred from one fine - tuning domain (such as movie reviews) to another related domain (such as product reviews). 3. **Comparison between clean - label and dirty - label attacks**: - The difference in the effectiveness of clean - label attacks (only modifying the input without changing the label) and dirty - label attacks (modifying both the input and the label) under different conditions. 4. **Defense mechanisms**: - Propose and evaluate two defense methods: - **Defense during fine - tuning**: A method based on word - frequency analysis to identify potential backdoor trigger words. - **Defense after fine - tuning**: Downstream fine - tuning of the backdoor - attacked LLM by using a small defense data set. ### Research background With the wide application of generative AI and large language models (such as GPT - 4, DALL - E 3), concerns have been raised about the security and reliability of these models. In particular, backdoor attacks, by inserting malicious trigger words into the training data, make the model produce incorrect responses when encountering specific trigger words, which may lead to serious security problems. For example, backdoor attacks can change the sentiment orientation of responses, violate the censorship system, inject false content, or trigger meaningless responses (hallucinations). ### Research methods The paper systematically evaluates the effects of backdoor attacks under different configurations through a series of experiments and proposes two defense strategies. The experiments mainly focus on the FLAN - T5 series of models and use multiple sentiment classification data sets (such as SST2, IMDB, Yelp Polarity, Amazon Polarity) for evaluation. ### Main findings - **Impact of trigger word positions**: When the trigger word is at the beginning or the end of the text, the attack success rate is higher and more transferable; while the effects of random or fixed positions are poorer. - **Robustness**: Replacing some trigger words or synonyms will significantly reduce the attack success rate. - **Cross - domain transferability**: The transfer effect of backdoor attacks in different domains is limited. - **Defense effect**: The defense method based on word - frequency analysis can effectively detect and identify backdoor trigger words. ### Conclusion The paper reveals the effectiveness and limitations of backdoor attacks during the instruction fine - tuning process and proposes effective defense measures. This is of great significance for improving the security and reliability of large language models.