Abstract:Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (1) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (2) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.

Self-Supervised Contrastive Learning with Adversarial Perturbations for Defending Word Substitution-based Attacks

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

Towards Improving Adversarial Training of NLP Models

SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks

TextCheater: A Query-Efficient Textual Adversarial Attack in the Hard-Label Setting

Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning

Towards Evaluating the Robustness of Chinese BERT Classifiers

Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning.

Certified Robustness to Adversarial Word Substitutions

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

Rethinking Textual Adversarial Defense for Pre-trained Language Models

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification

Modeling Adversarial Attack on Pre-trained Language Models As Sequential Decision Making

Robustifying Language Models with Test-Time Adaptation

Enhancing Adversarial Text Attacks on BERT Models with Projected Gradient Descent

Textual Adversarial Attack As Combinatorial Optimization