Abstract:Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (1) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (2) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.

A Lightweight Chinese Multimodal Textual Defense Method Based on Contrastive-Adversarial Training

WordChange: Adversarial Examples Generation Approach for Chinese Text Classification

TextDefense: Adversarial Text Detection based on Word Importance Entropy

WordRevert: Adversarial Examples Defence Method for Chinese Text Classification

GPSAttack: A Unified Glyphs, Phonetics and Semantics Multi-Modal Attack against Chinese Text Classification Models

TEXTSHIELD: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation

Searching for an Effective Defender: Benchmarking Defense Against Adversarial Word Substitution

Defensive Dual Masking for Robust Adversarial Defense

Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model

Self-Supervised Contrastive Learning with Adversarial Perturbations for Defending Word Substitution-based Attacks

Rethinking Textual Adversarial Defense for Pre-trained Language Models

An adversarial-example generation method for Chinese sentiment tendency classification based on audiovisual confusion and contextual association

Textual Adversarial Attack As Combinatorial Optimization

Mutual-modality Adversarial Attack with Semantic Perturbation

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

WordIllusion: An Adversarial Text Generation Algorithm Based on Human Cognitive System

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models

Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script