NLPSweep: A comprehensive defense scheme for mitigating NLP backdoor attacks

Tao Xiang,Fei Ouyang,Di Zhang,Chunlong Xie,Hao Wang
DOI: https://doi.org/10.1016/j.ins.2024.120176
IF: 8.1
2024-01-26
Information Sciences
Abstract:Natural language processing (NLP) backdoor attacks have become a hidden threat to modern NLP applications. Most of the existing defense methods defend against specific types of backdoor attacks, and they generally fail to defend against invisible backdoors with syntactically correct triggers. This paper proposes NLPSweep, a comprehensive defense scheme that can defend against five common types of backdoor attacks, namely, character, word, sentence, homograph, and learnable textual attacks. Specifically, we propose a framework that can discover an effective defense solution without prior knowledge of the attacks. The defense solution is optimized from the framework and can defend against various attacks while ensuring high accuracy. Finally, we verify the effectiveness of NLPSweep on two pretrained models (BERT and XLNET) on three classic datasets (SST-2, IMDB, and OLID) and compare it with five state-of-the-art defense methods, namely, ONION, Pred, RAP, Fine-pruning, and STRIP. The experimental results demonstrate that NLPSweep has an average model accuracy (ACC) greater than 0.922 and that the average attack success rate (ASR) is only 0.202, outperforming the compared methods. Furthermore, NLPSweep is tested on the real-world Yelp dataset and it can effectively defend against backdoor attacks with the ASR less than 0.07 and the ACC greater than 0.973. 1
computer science, information systems
What problem does this paper attempt to address?