Abstract:Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (1) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (2) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.

Certified robustness to word substitution ranking attack for neural ranking models

Searching for an Effective Defender: Benchmarking Defense Against Adversarial Word Substitution

Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off

Defense of Word-level Adversarial Attacks via Random Substitution Encoding

Multi-granular Adversarial Attacks against Black-box Neural Ranking Models

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

Adversarial Attack and Defense in Deep Ranking

Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

Certified Robustness to Adversarial Word Substitutions

Towards Imperceptible Document Manipulations against Neural Ranking Models

Adversarial Ranking Attack and Defense

Robustra: Training Provable Robust Neural Networks over Reference Adversarial Space.

Rethinking Textual Adversarial Defense for Pre-trained Language Models

RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Towards Evaluating the Robustness of Neural Networks

Beyond Score Changes: Adversarial Attack on No-Reference Image Quality Assessment from Two Perspectives

Boosting adversarial robustness via feature refinement, suppression, and alignment

Enhancing Neural Models with Vulnerability Via Adversarial Attack.

Certified Robustness Against Natural Language Attacks by Causal Intervention

Reversible Jump Attack to Textual Classifiers with Modification Reduction

Rethinking Targeted Adversarial Attacks For Neural Machine Translation