Neutralizing Backdoors through Information Conflicts for Large Language Models

Chen Chen,Yuchen Sun,Xueluan Gong,Jiaxin Gao,Kwok-Yan Lam
2024-11-27
Abstract:Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the backdoor attack problem in large - language models (LLMs). Specifically, a backdoor attack means that a malicious provider embeds a specific trigger into the model during the training process, so that the model behaves normally when processing normal queries, but when the specific trigger is activated, it will generate harmful or unwanted responses. The existing backdoor defense methods have the following deficiencies: 1. **Detection without Removal**: Many existing methods mainly focus on detecting backdoors and do not provide effective removal schemes. 2. **Dependence on Assumptions**: Some methods rely on strict assumptions about the characteristics of triggers, such as the size, type or location of triggers, which limits their effectiveness in complex attacks. 3. **Ineffective against Advanced Attacks**: Existing methods are often ineffective when facing multi - trigger or multi - stage backdoor attacks. To solve these problems, this paper proposes a new framework to eliminate backdoor behavior in LLMs by constructing internal and external information conflicts. The specific methods are as follows: ### Internal Information Conflict - **Conflict Model Training**: Use a small amount of clean data to train a lightweight conflict model (through the low - rank adaptation LoRA technique), which introduces information that contradicts the parameter memory of the backdoor model. - **Model Merging**: Merge the conflict model with the backdoor model and neutralize the backdoor behavior by embedding contradictory information. ### External Information Conflict - **Prompt - level Conflict**: Introduce contradictory evidence in the prompt to challenge the internal backdoor knowledge of the model. - **Evidence Generation and Modification**: If the model can generate supporting evidence, modify this evidence to introduce contradictions; if it cannot generate evidence, use an external LLM (such as GPT - 3.5) to generate contradictory evidence according to the keywords in the query and combine it with the original input to reduce the effectiveness of backdoor attacks. ### Experimental Results The experimental results show that this method significantly reduces the success rate of 8 advanced backdoor attacks on 4 widely - used LLMs (GPT2 - XL, GPT - J, LLaMA, LLaMA - 2) in classification and dialogue tasks, up to 98%, while maintaining more than 90% clean - data accuracy. In addition, this method also shows robustness against adaptive backdoor attacks. ### Main Contributions 1. Proposed a new backdoor removal framework, which effectively eliminates backdoor behavior by introducing internal and external information conflicts at the parameter level and the prompt level, without prior knowledge of the specific information of the trigger or large - scale retraining. 2. Introduced an internal conflict model, which is trained with a small amount of clean data and merged into the backdoor model, and an external conflict strategy, which integrates contradictory evidence into the prompt to further strengthen the conflict mechanism and ensure complete neutralization of the backdoor influence. 3. Extensive experiments have verified the effectiveness of this method, significantly reducing the attack success rate while maintaining high - precision clean - data performance and being robust against advanced and adaptive backdoor attacks.