Abstract:The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the robustness of text classification models against text adversarial attacks and provide provable robustness guarantees, rather than just empirical robustness. Specifically, the paper proposes a general certified robustness framework - Text - CRS for four common word - level adversarial operations (synonym substitution, word re - ordering, word insertion, and word deletion). ### Problem Background 1. **Vulnerability of Text Classification Models**: Existing text classification models (such as deep - learning - based language models) are vulnerable to word - level adversarial attacks, such as synonym substitution, word insertion, etc. These attacks can cause the model to output wrong results through minor modifications, and thus can be maliciously exploited to spread false information or bypass content review. 2. **Limitations of Existing Defense Methods**: - Most existing defense methods can only provide empirical robustness, that is, verify the performance of the model under certain attacks through experiments, but cannot provide theoretical guarantees. - Most of the existing certified defense methods can only handle specific types of attacks, for example, only limited to $ \ell_0 $ perturbations in synonym substitution attacks. - These methods usually assume that synonyms are evenly distributed, which is not realistic in practical applications, resulting in low certified accuracy. ### Core Contributions of the Paper 1. **Proposing the Text - CRS Framework**: This is a general certified robustness framework based on randomized smoothing, which can handle four common word - level adversarial operations and provide theoretical robustness guarantees for each operation. 2. **New Robustness Theorems**: For each word - level adversarial operation, the paper proposes customized theorems, using different noise distributions (such as staircase - shaped distribution, uniform distribution, Gaussian distribution, and Bernoulli distribution) to simulate different attack methods. These theorems can derive the robustness boundaries of each operation. 3. **Improved Training Toolkit**: In order to further improve the certified accuracy and robustness radius, the paper also proposes a set of optimization techniques, including using anisotropic Gaussian noise to expand the certified radius. 4. **Extensive Experimental Verification**: The paper conducts a large number of experiments on multiple datasets (such as AG's News, Amazon, IMDB) and two NLP models (LSTM and BERT) to verify the effectiveness of Text - CRS. The experimental results show that the average certified accuracy of Text - CRS under five representative adversarial attacks reaches 81.7%, which is 64% higher than that of existing methods. ### Summary By introducing the Text - CRS framework, the paper not only solves the limitations of existing defense methods, but also provides stronger robustness and higher certified accuracy for text classification models. This framework provides a new benchmark and direction for future research, especially in dealing with multiple word - level adversarial attacks.

Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models

Certified Robustness Against Natural Language Attacks by Causal Intervention

Certified Adversarial Robustness Within Multiple Perturbation Bounds

CERT-ED: Certifiably Robust Text Classification for Edit Distance

Robustness-Aware Word Embedding Improves Certified Robustness to Adversarial Word Substitutions

Certified Robustness to Adversarial Word Substitutions

Towards Bridging the gap between Empirical and Certified Robustness against Adversarial Examples

Rethinking Textual Adversarial Defense for Pre-trained Language Models

Adaptive Randomized Smoothing: Certified Adversarial Robustness for Multi-Step Defences

Certified Robustness to Programmable Transformations in LSTMs

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework

Regularized Training and Tight Certification for Randomized Smoothed Classifier with Provable Robustness

Textual Adversarial Attack As Combinatorial Optimization

Provably Robust Cost-Sensitive Learning via Randomized Smoothing

CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation

Advancing the Robustness of Large Language Models through Self-Denoised Smoothing

Adversarial Robustification via Text-to-Image Diffusion Models

Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples

Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey.