Abstract:Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.

MIC: an Effective Defense Against Word-Level Textual Backdoor Attacks

B3: Backdoor Attacks Against Black-box Machine Learning Models

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats

BDDR: An Effective Defense Against Textual Backdoor Attacks

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing

Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

NCL: Textual Backdoor Defense Using Noise-augmented Contrastive Learning

A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Training-free Lexical Backdoor Attacks on Language Models

Composite Backdoor Attacks Against Large Language Models

Neutralizing Backdoors through Information Conflicts for Large Language Models

Expose Backdoors on the Way: A Feature-Based Efficient Defense Against Textual Backdoor Attacks

Hidden Backdoors in Human-Centric Language Models

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

Data-centric NLP Backdoor Defense from the Lens of Memorization

Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing