Abstract:Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.

Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation

Clean-image Backdoor: Attacking Multi-label Models with Poisoned Labels Only

AdaptGuard: Defending Against Universal Attacks for Model Adaptation

A Unified Framework for Adversarial Attacks on Multi-Source Domain Adaptation

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data

Label-free Poisoning Attack Against Deep Unsupervised Domain Adaptation

Universal Backdoor Attacks

On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning

Effective Backdoor Defense by Exploiting Sensitivity of Poisoned Samples

Beating Backdoor Attack at Its Own Game

Poisoning-based Backdoor Attacks for Arbitrary Target Label with Positive Triggers

A Concealed Poisoning Attack to Reduce Deep Neural Networks’ Robustness Against Adversarial Samples

Backdoor Defense via Adaptively Splitting Poisoned Dataset

Data Stealing Attacks against Large Language Models via Backdooring

How to Craft Backdoors with Unlabeled Data Alone?

Unleashing the Potential of Adaptation Models via Go-getting Domain Labels.

Do We Really Need Labels for Backdoor Defense?

Circumventing Backdoor Defenses That Are Based on Latent Separability

DeHiB: Deep Hidden Backdoor Attack on Semi-supervised Learning Via Adversarial Perturbation.