Abstract:Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

B3: Backdoor Attacks Against Black-box Machine Learning Models

Backdoor Pre-trained Models Can Transfer to All

Multi-target Backdoor Attacks for Code Pre-trained Models

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

The triggers that open the NLP model backdoors are hidden in the adversarial samples

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

SynGhost: Imperceptible and Universal Task-agnostic Backdoor Attack in Pre-trained Language Models

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Hidden Backdoors in Human-Centric Language Models

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

Training-free Lexical Backdoor Attacks on Language Models

Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

A backdoor attack against LSTM-based text classification systems

Triggerless Backdoor Attack for NLP Tasks with Clean Labels