The triggers that open the NLP model backdoors are hidden in the adversarial samples

Kun Shao,Yu Zhang,Junan Yang,Xiaoshuai Li,Hui Liu
DOI: https://doi.org/10.1016/j.cose.2022.102730
2022-07-01
Abstract:Deep neural networks (DNNS) have been proven to be vulnerable to adversarial attacks. But the adversarial perturbations are generated for specific input samples, and the perturbations of one sample cannot be applied to other samples. In this paper, we propose a method to search for the backdoor of the natural language processing (NLP) model under the black-box condition, and we find that the universal attack triggers exist in the adversarial samples. The method includes two steps. The first step is to extract aggressive words in the adversarial sample to form the adversarial knowledge base under the black-box condition. The second step is to generate universal attack triggers by minimizing the target prediction results of a batch of samples. When we add the generated trigger to any benign input, the prediction accuracy of the DNNS model can be reduced to close to zero. The experimental results show that our method can achieve a high attack success rate with a short trigger (e.g., more than 90% using only a trigger of length 3 when attacking BiLSTM on SST-2). In addition, experiments show that our method has higher transferability. Finally, for the backdoor vulnerabilities in the NLP models, we did two defense experiments: abnormal word detection and word frequency analysis, which improve the NLP model’s ability of resisting backdoor attacks.
computer science, information systems
What problem does this paper attempt to address?