Fast Adversarial Training against Textual Adversarial Attacks

Yichen Yang,Xin Liu,Kun He
2024-01-23
Abstract:Many adversarial defense methods have been proposed to enhance the adversarial robustness of natural language processing models. However, most of them introduce additional pre-set linguistic knowledge and assume that the synonym candidates used by attackers are accessible, which is an ideal assumption. We delve into adversarial training in the embedding space and propose a Fast Adversarial Training (FAT) method to improve the model robustness in the synonym-unaware scenario from the perspective of single-step perturbation generation and perturbation initialization. Based on the observation that the adversarial perturbations crafted by single-step and multi-step gradient ascent are similar, FAT uses single-step gradient ascent to craft adversarial examples in the embedding space to expedite the training process. Based on the observation that the perturbations generated on the identical training sample in successive epochs are similar, FAT fully utilizes historical information when initializing the perturbation. Extensive experiments demonstrate that FAT significantly boosts the robustness of BERT models in the synonym-unaware scenario, and outperforms the defense baselines under various attacks with character-level and word-level modifications.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper focuses on the efficiency and effectiveness of adversarial training (AT) in natural language processing (NLP) models, particularly in scenarios where synonym information is not available. Most existing defense methods assume prior knowledge of the synonyms that attackers may use, which is impractical. The paper proposes the Fast Adversarial Training (FAT) method, which operates in the embedding space, to accelerate the training process by generating adversarial perturbations through one-step gradient ascent and initializing the perturbations using historical information. The study observes that for NLP models, adversarial perturbations generated through one-step and multi-step gradient ascent are similar. Therefore, FAT adopts one-step gradient ascent to improve training efficiency. Additionally, the paper finds that the perturbation direction for the same training sample remains similar across consecutive training epochs. Hence, FAT utilizes this historical information to initialize the perturbations and make full use of the training data history. The paper points out that traditional adversarial training methods, such as PGD-AT, are inefficient for large pre-trained models like BERT due to the need for multiple iterations to generate adversarial samples. In contrast, FAT reduces the number of iterations, allowing for more training epochs within a limited time and thereby enhancing the model's robustness. Experimental results demonstrate that FAT significantly improves the robustness of the BERT model against attacks at different model visibilities and granularity levels, outperforming various defense baseline methods. The paper also proposes a variant method, FAT-I, to further accelerate the training process with minimal loss in model robustness. In summary, the main contributions of the paper include: 1. Introducing the FAT method, which accelerates adversarial training through one-step gradient ascent and optimizes model robustness using historical information. 2. Introducing FAT-I, which increases training efficiency by reducing the perturbation update frequency for each training sample. 3. Providing an easy-to-apply and effective adversarial defense solution in the realistic "synonym unknown" scenario, with the best robustness performance against various advanced attacks.