Abstract:Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.

CBAs: Character-level Backdoor Attacks Against Chinese Pre-trained Language Models

B3: Backdoor Attacks Against Black-box Machine Learning Models

The Silent Manipulator: A Practical and Inaudible Backdoor Attack against Speech Recognition Systems

Composite Backdoor Attacks Against Large Language Models

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Hidden Backdoors in Human-Centric Language Models

Backdoor Pre-trained Models Can Transfer to All

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

Defense against Backdoor Attack on Pre-trained Language Models via Head Pruning and Attention Normalization

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Exploring Backdoor Vulnerabilities of Chat Models

Watermarking Pre-trained Language Models with Backdooring

Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

UOR: Universal Backdoor Attacks on Pre-trained Language Models

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Training-free Lexical Backdoor Attacks on Language Models

Data Stealing Attacks against Large Language Models via Backdooring

Rethinking Backdoor Detection Evaluation for Language Models

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots