CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation

Tianlu Wang,Xuezhi Wang,Yao Qin,Ben Packer,Kang Li,Jilin Chen,Alex Beutel,Ed Chi
DOI: https://doi.org/10.48550/arXiv.2010.02338
2020-10-06
Abstract:NLP models are shown to suffer from robustness issues, i.e., a model's prediction can be easily changed under small perturbations to the input. In this work, we present a Controlled Adversarial Text Generation (CAT-Gen) model that, given an input text, generates adversarial texts through controllable attributes that are known to be invariant to task labels. For example, in order to attack a model for sentiment classification over product reviews, we can use the product categories as the controllable attribute which would not change the sentiment of the reviews. Experiments on real-world NLP datasets demonstrate that our method can generate more diverse and fluent adversarial texts, compared to many existing adversarial text generation approaches. We further use our generated adversarial examples to improve models through adversarial training, and we demonstrate that our generated attacks are more robust against model re-training and different model architectures.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of robustness exhibited by natural language processing (NLP) models when faced with small perturbations in the input text. Specifically, the author points out that current NLP models may have significant changes in their prediction results when encountering slightly modified input texts. This vulnerability not only affects the reliability of the models but also limits their performance in practical applications. To address this challenge, the paper proposes a new method named "Controlled Adversarial Text Generation" (CAT - Gen). The core idea of CAT - Gen is to generate adversarial texts by controlling attributes that are unrelated to the task labels. For example, in a sentiment classification task, adversarial samples can be generated by changing the product category (such as from electronic products to kitchenware) without changing the sentiment tendency of the comment. This method aims to generate more diverse and fluent adversarial texts to improve the model's resistance to these attacks. The main contributions of the paper include: 1. **Proposing the CAT - Gen model**: This model can generate adversarial texts by controlling specific attributes that are unrelated to the labels of the main task but can affect the model's prediction results. 2. **Improving the quality of adversarial samples**: Compared with existing adversarial text generation methods, the adversarial samples generated by CAT - Gen are more diverse and fluent. 3. **Enhancing the robustness of the model**: By using the generated adversarial samples for adversarial training, the model's resistance to different architectures and retraining can be significantly improved. Through experimental verification, CAT - Gen has demonstrated its effectiveness and universality on multiple real - world datasets. The generated adversarial samples are not only more natural but also more effective in attacking the target model. This provides a new idea and method for improving the robustness of NLP models.