Abstract:Recent studies reveal that Large Language Models (LLMs) are susceptible to backdoor attacks, where adversaries embed hidden triggers that manipulate model responses. Existing backdoor defense methods are primarily designed for vision or classification tasks, and are thus ineffective for text generation tasks, leaving LLMs vulnerable. We introduce Internal Consistency Regularization (CROW), a novel defense using consistency regularization finetuning to address layer-wise inconsistencies caused by backdoor triggers. CROW leverages the intuition that clean models exhibit smooth, consistent transitions in hidden representations across layers, whereas backdoored models show noticeable fluctuation when triggered. By enforcing internal consistency through adversarial perturbations and regularization, CROW neutralizes backdoor effects without requiring clean reference models or prior trigger knowledge, relying only on a small set of clean data. This makes it practical for deployment across various LLM architectures. Experimental results demonstrate that CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks, including negative sentiment, targeted refusal, and code injection, on models such as Llama-2 (7B, 13B), CodeLlama (7B, 13B) and Mistral-7B, while preserving the model's generative capabilities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of large - language models (LLMs) when facing backdoor attacks. Specifically, the researchers found that the existing backdoor defense methods are mainly designed for visual or classification tasks and are not effective for text - generation tasks, causing LLMs to be vulnerable to backdoor attacks. These attacks manipulate the model's responses by embedding hidden triggers, seriously threatening the security of the model. To address this challenge, the paper proposes a new method named Internal Consistency Regularization (CROW). The core idea of CROW is that a clean model exhibits smooth and consistent hidden - representation transformations between layers, while a backdoor - implanted model shows obvious fluctuations when triggered. By introducing adversarial perturbations and applying regularization during the fine - tuning process, CROW forces the model to learn stable hidden - state transformations, thereby neutralizing the backdoor effect without relying on a clean reference model or prior knowledge of the trigger information, and can be effectively deployed with only a small amount of clean data. The main contributions of the paper include: 1. **New backdoor defense**: Proposed CROW, a backdoor defense method that does not require a reference model or knowledge of the trigger. 2. **Theoretical basis**: Defined the internal consistency between model layers and showed how backdoors can break this consistency. 3. **Comprehensive evaluation**: Conducted the most comprehensive experimental evaluation of six different backdoor attack strategies, covering three backdoor tasks and five LLM architectures, verifying the effectiveness and generalization ability of CROW. Through this method, CROW not only significantly reduces the success rate of various backdoor attacks but also maintains the model's generation ability and usefulness, providing a new solution for ensuring the security of LLMs.

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Neutralizing Backdoors through Information Conflicts for Large Language Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Composite Backdoor Attacks Against Large Language Models

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Enhancing Adversarial Resistance in LLMs with Recursion

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

BadEdit: Backdooring large language models by model editing

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks