Abstract:Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the backdoor attack problem in large - language models (LLMs). Specifically, a backdoor attack means that a malicious provider embeds a specific trigger into the model during the training process, so that the model behaves normally when processing normal queries, but when the specific trigger is activated, it will generate harmful or unwanted responses. The existing backdoor defense methods have the following deficiencies: 1. **Detection without Removal**: Many existing methods mainly focus on detecting backdoors and do not provide effective removal schemes. 2. **Dependence on Assumptions**: Some methods rely on strict assumptions about the characteristics of triggers, such as the size, type or location of triggers, which limits their effectiveness in complex attacks. 3. **Ineffective against Advanced Attacks**: Existing methods are often ineffective when facing multi - trigger or multi - stage backdoor attacks. To solve these problems, this paper proposes a new framework to eliminate backdoor behavior in LLMs by constructing internal and external information conflicts. The specific methods are as follows: ### Internal Information Conflict - **Conflict Model Training**: Use a small amount of clean data to train a lightweight conflict model (through the low - rank adaptation LoRA technique), which introduces information that contradicts the parameter memory of the backdoor model. - **Model Merging**: Merge the conflict model with the backdoor model and neutralize the backdoor behavior by embedding contradictory information. ### External Information Conflict - **Prompt - level Conflict**: Introduce contradictory evidence in the prompt to challenge the internal backdoor knowledge of the model. - **Evidence Generation and Modification**: If the model can generate supporting evidence, modify this evidence to introduce contradictions; if it cannot generate evidence, use an external LLM (such as GPT - 3.5) to generate contradictory evidence according to the keywords in the query and combine it with the original input to reduce the effectiveness of backdoor attacks. ### Experimental Results The experimental results show that this method significantly reduces the success rate of 8 advanced backdoor attacks on 4 widely - used LLMs (GPT2 - XL, GPT - J, LLaMA, LLaMA - 2) in classification and dialogue tasks, up to 98%, while maintaining more than 90% clean - data accuracy. In addition, this method also shows robustness against adaptive backdoor attacks. ### Main Contributions 1. Proposed a new backdoor removal framework, which effectively eliminates backdoor behavior by introducing internal and external information conflicts at the parameter level and the prompt level, without prior knowledge of the specific information of the trigger or large - scale retraining. 2. Introduced an internal conflict model, which is trained with a small amount of clean data and merged into the backdoor model, and an external conflict strategy, which integrates contradictory evidence into the prompt to further strengthen the conflict mechanism and ensure complete neutralization of the backdoor influence. 3. Extensive experiments have verified the effectiveness of this method, significantly reducing the attack success rate while maintaining high - precision clean - data performance and being robust against advanced and adaptive backdoor attacks.

Neutralizing Backdoors through Information Conflicts for Large Language Models

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Composite Backdoor Attacks Against Large Language Models

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Data Stealing Attacks against Large Language Models via Backdooring

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers

Backdoor Attacks for In-Context Learning with Language Models

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Rethinking Backdoor Detection Evaluation for Language Models

DeCE: Deceptive Cross-Entropy Loss Designed for Defending Backdoor Attacks

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations