Abstract:The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the robustness of text classification models against text adversarial attacks and provide provable robustness guarantees, rather than just empirical robustness. Specifically, the paper proposes a general certified robustness framework - Text - CRS for four common word - level adversarial operations (synonym substitution, word re - ordering, word insertion, and word deletion).
### Problem Background
1. **Vulnerability of Text Classification Models**: Existing text classification models (such as deep - learning - based language models) are vulnerable to word - level adversarial attacks, such as synonym substitution, word insertion, etc. These attacks can cause the model to output wrong results through minor modifications, and thus can be maliciously exploited to spread false information or bypass content review.
2. **Limitations of Existing Defense Methods**:
- Most existing defense methods can only provide empirical robustness, that is, verify the performance of the model under certain attacks through experiments, but cannot provide theoretical guarantees.
- Most of the existing certified defense methods can only handle specific types of attacks, for example, only limited to \( \ell_0 \) perturbations in synonym substitution attacks.
- These methods usually assume that synonyms are evenly distributed, which is not realistic in practical applications, resulting in low certified accuracy.
### Core Contributions of the Paper
1. **Proposing the Text - CRS Framework**: This is a general certified robustness framework based on randomized smoothing, which can handle four common word - level adversarial operations and provide theoretical robustness guarantees for each operation.
2. **New Robustness Theorems**: For each word - level adversarial operation, the paper proposes customized theorems, using different noise distributions (such as staircase - shaped distribution, uniform distribution, Gaussian distribution, and Bernoulli distribution) to simulate different attack methods. These theorems can derive the robustness boundaries of each operation.
3. **Improved Training Toolkit**: In order to further improve the certified accuracy and robustness radius, the paper also proposes a set of optimization techniques, including using anisotropic Gaussian noise to expand the certified radius.
4. **Extensive Experimental Verification**: The paper conducts a large number of experiments on multiple datasets (such as AG's News, Amazon, IMDB) and two NLP models (LSTM and BERT) to verify the effectiveness of Text - CRS. The experimental results show that the average certified accuracy of Text - CRS under five representative adversarial attacks reaches 81.7%, which is 64% higher than that of existing methods.
### Summary
By introducing the Text - CRS framework, the paper not only solves the limitations of existing defense methods, but also provides stronger robustness and higher certified accuracy for text classification models. This framework provides a new benchmark and direction for future research, especially in dealing with multiple word - level adversarial attacks.