Abstract:Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.

NLPSweep: A comprehensive defense scheme for mitigating NLP backdoor attacks

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Expose Backdoors on the Way: A Feature-Based Efficient Defense Against Textual Backdoor Attacks

Neutralizing Backdoors through Information Conflicts for Large Language Models

Defending Against Backdoor Attacks in Natural Language Generation

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

BDDR: An Effective Defense Against Textual Backdoor Attacks

Defense against Backdoor Attack on Pre-trained Language Models via Head Pruning and Attention Normalization

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks

Beating Backdoor Attack at Its Own Game

The triggers that open the NLP model backdoors are hidden in the adversarial samples

Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing

Hidden Backdoors in Human-Centric Language Models

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

DeepDefense: A Steganalysis-Based Backdoor Detecting and Mitigating Protocol in Deep Neural Networks for AI Security

Backdoor Pre-trained Models Can Transfer to All