Combating Multi-level Adversarial Text with Pruning Based Adversarial Training
Jianpeng Ke,Lina Wang,Aoshuang Ye,Jie Fu
DOI: https://doi.org/10.1109/ijcnn55064.2022.9892314
2022-01-01
Abstract:Despite significant advancements of deep learning-based models for natural language processing (NLP) tasks, previous efforts have shown that numerous models, including deep neural networks (DNNs), suffer from moderate to significant performance degradation with adversarial examples. Adversary crafts malicious text by adding, deleting, modifying chars, words, and sentences, to fool the DNN models. Therefore, adversarial training and model enhanced methods are proposed to combat the adversarial attack. However, both methods are lack generalization due to the overfitting intrinsic of neural networks. In this paper, we propose a novel framework to combat text adversarial examples, namely DisPAT, which consists an adversarial text discriminator and a robust pruned text classifier. First, we explore the adversarial examples and benign examples distribution in embedding space, indicating the feasibility of a DNN-based discriminator. To get multi-level adversarial texts, we deploy a generator, and a discriminator to identify adversarial perturbations. Notably, in the inference stage, our pipeline places the well-trained discriminator in front of the text classifier to distinguish the char-level adversarial text. Finally, we apply neuron-salience-based pruning to specifically improve the classifier performance of adversarial text. Experimental results show that our approach outperforms state-of-the-art baselines in combating both char-level and word-level adversarial text. Moreover, DisPAT achieves a very close to or even higher accuracy than that of the standard model.