Abstract:Many adversarial defense methods have been proposed to enhance the adversarial robustness of natural language processing models. However, most of them introduce additional pre-set linguistic knowledge and assume that the synonym candidates used by attackers are accessible, which is an ideal assumption. We delve into adversarial training in the embedding space and propose a Fast Adversarial Training (FAT) method to improve the model robustness in the synonym-unaware scenario from the perspective of single-step perturbation generation and perturbation initialization. Based on the observation that the adversarial perturbations crafted by single-step and multi-step gradient ascent are similar, FAT uses single-step gradient ascent to craft adversarial examples in the embedding space to expedite the training process. Based on the observation that the perturbations generated on the identical training sample in successive epochs are similar, FAT fully utilizes historical information when initializing the perturbation. Extensive experiments demonstrate that FAT significantly boosts the robustness of BERT models in the synonym-unaware scenario, and outperforms the defense baselines under various attacks with character-level and word-level modifications.

What problem does this paper attempt to address?

This paper focuses on the efficiency and effectiveness of adversarial training (AT) in natural language processing (NLP) models, particularly in scenarios where synonym information is not available. Most existing defense methods assume prior knowledge of the synonyms that attackers may use, which is impractical. The paper proposes the Fast Adversarial Training (FAT) method, which operates in the embedding space, to accelerate the training process by generating adversarial perturbations through one-step gradient ascent and initializing the perturbations using historical information. The study observes that for NLP models, adversarial perturbations generated through one-step and multi-step gradient ascent are similar. Therefore, FAT adopts one-step gradient ascent to improve training efficiency. Additionally, the paper finds that the perturbation direction for the same training sample remains similar across consecutive training epochs. Hence, FAT utilizes this historical information to initialize the perturbations and make full use of the training data history. The paper points out that traditional adversarial training methods, such as PGD-AT, are inefficient for large pre-trained models like BERT due to the need for multiple iterations to generate adversarial samples. In contrast, FAT reduces the number of iterations, allowing for more training epochs within a limited time and thereby enhancing the model's robustness. Experimental results demonstrate that FAT significantly improves the robustness of the BERT model against attacks at different model visibilities and granularity levels, outperforming various defense baseline methods. The paper also proposes a variant method, FAT-I, to further accelerate the training process with minimal loss in model robustness. In summary, the main contributions of the paper include: 1. Introducing the FAT method, which accelerates adversarial training through one-step gradient ascent and optimizes model robustness using historical information. 2. Introducing FAT-I, which increases training efficiency by reducing the perturbation update frequency for each training sample. 3. Providing an easy-to-apply and effective adversarial defense solution in the realistic "synonym unknown" scenario, with the best robustness performance against various advanced attacks.

Fast Adversarial Training against Textual Adversarial Attacks

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

GEAR: A Margin-based Federated Adversarial Training Approach

Towards Understanding Fast Adversarial Training

Initializing Perturbations in Multiple Directions for Fast Adversarial Training

MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks

Improving Fast Adversarial Training Paradigm: An Example Taxonomy Perspective

Improving Fast Adversarial Training via Self-Knowledge Guidance

TextAT: Adversarial Training for Natural Language Understanding with Token-Level Perturbation.

Fast Adversarial Training with Smooth Convergence

Towards Improving Adversarial Training of NLP Models

LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification

Textual Adversarial Attack As Combinatorial Optimization

Improving Fast Adversarial Training with Prior-Guided Knowledge

Semantic-Preserving Adversarial Text Attacks

Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning.

Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder.

Natural Language Adversarial Defense through Synonym Encoding

FastTextDodger: Decision-Based Adversarial Attack Against Black-Box NLP Models With Extremely High Efficiency

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging