Abstract:NLP models are shown to suffer from robustness issues, i.e., a model's prediction can be easily changed under small perturbations to the input. In this work, we present a Controlled Adversarial Text Generation (CAT-Gen) model that, given an input text, generates adversarial texts through controllable attributes that are known to be invariant to task labels. For example, in order to attack a model for sentiment classification over product reviews, we can use the product categories as the controllable attribute which would not change the sentiment of the reviews. Experiments on real-world NLP datasets demonstrate that our method can generate more diverse and fluent adversarial texts, compared to many existing adversarial text generation approaches. We further use our generated adversarial examples to improve models through adversarial training, and we demonstrate that our generated attacks are more robust against model re-training and different model architectures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of robustness exhibited by natural language processing (NLP) models when faced with small perturbations in the input text. Specifically, the author points out that current NLP models may have significant changes in their prediction results when encountering slightly modified input texts. This vulnerability not only affects the reliability of the models but also limits their performance in practical applications. To address this challenge, the paper proposes a new method named "Controlled Adversarial Text Generation" (CAT - Gen). The core idea of CAT - Gen is to generate adversarial texts by controlling attributes that are unrelated to the task labels. For example, in a sentiment classification task, adversarial samples can be generated by changing the product category (such as from electronic products to kitchenware) without changing the sentiment tendency of the comment. This method aims to generate more diverse and fluent adversarial texts to improve the model's resistance to these attacks. The main contributions of the paper include: 1. **Proposing the CAT - Gen model**: This model can generate adversarial texts by controlling specific attributes that are unrelated to the labels of the main task but can affect the model's prediction results. 2. **Improving the quality of adversarial samples**: Compared with existing adversarial text generation methods, the adversarial samples generated by CAT - Gen are more diverse and fluent. 3. **Enhancing the robustness of the model**: By using the generated adversarial samples for adversarial training, the model's resistance to different architectures and retraining can be significantly improved. Through experimental verification, CAT - Gen has demonstrated its effectiveness and universality on multiple real - world datasets. The generated adversarial samples are not only more natural but also more effective in attacking the target model. This provides a new idea and method for improving the robustness of NLP models.

CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

Repairing Adversarial Texts Through Perturbation

ValCAT: Variable-Length Contextualized Adversarial Transformations Using Encoder-Decoder Language Model

MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks

CAT: Customized Adversarial Training for Improved Robustness

Automatic Generation of Adversarial Readable Chinese Texts

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

Exploring the Vulnerability of Natural Language Processing Models via Universal Adversarial Texts

Generating natural adversarial examples with universal perturbations for text classification

Towards Improving Adversarial Training of NLP Models

SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

CAT:Collaborative Adversarial Training

Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations

Towards a Robust Deep Neural Network in Texts: A Survey

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense

Rule-based adversarial sample generation for text classification

Natural Language Induced Adversarial Images

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework