Adversarial Text Purification: A Large Language Model Approach for Defense

Raha Moraffah, Shubh Khandelwal, Amrita Bhattacharjee, Huan Liu
2024-02-05
Abstract:Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks without knowing the type of attacks or training of the classifier. These techniques characterize and eliminate adversarial perturbations from the attacked inputs, aiming to restore purified samples that retain similarity to the initially attacked ones and are correctly classified by the classifier. Due to the inherent challenges associated with characterizing noise perturbations for discrete inputs, adversarial text purification has been relatively unexplored. In this paper, we investigate the effectiveness of adversarial purification methods in defending text classifiers. We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models (LLMs) to purify adversarial text without the need to explicitly characterize the discrete noise perturbations. We utilize prompt engineering to exploit LLMs for recovering the purified examples for given adversarial examples such that they are semantically similar and correctly classified. Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
Cryptography and Security,Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the vulnerability of text classification models when faced with adversarial attacks. Specifically, the authors explore how to use large language models (LLMs) for adversarial text purification to defend against these attacks. The goal of adversarial text purification is to identify and eliminate adversarial perturbations from the attacked input, thereby restoring a purified sample that is similar to the original input and can be correctly classified. ### Background and Challenges 1. **Threat of Adversarial Attacks**: - Adversarial attacks cause text classification models to misclassify by adding small but carefully crafted perturbations to the input data. - These attacks pose a serious threat to the reliability and integrity of natural language processing (NLP) applications. 2. **Current State of Adversarial Purification**: - Adversarial purification is a defense mechanism that generates purified samples by removing adversarial perturbations from the input. - Although successful in the field of image classification, research on adversarial purification in text classification is relatively scarce due to the discrete nature of text inputs. ### Solution 1. **Utilizing Large Language Models**: - The authors propose an adversarial text purification method based on large language models (such as GPT-3.5). - Through prompt engineering, they leverage the generative capabilities and contextual understanding of LLMs to directly generate purified samples from adversarial texts without explicitly representing the perturbations. 2. **Experimental Validation**: - The authors conducted experiments on two commonly used NLP datasets (IMDb and AG News) to validate the effectiveness of the proposed method. - Experimental results show that the method significantly improves classification accuracy after adversarial attacks across various classifiers, with an average improvement of over 65%. ### Main Contributions 1. **Exploring the Feasibility of Adversarial Text Purification**: - Investigated whether text adversarial purification defenses can be effectively implemented. 2. **First Use of LLMs for Text Adversarial Purification**: - Proposed an effective text adversarial purification method by leveraging the contextual understanding and generative capabilities of LLMs. 3. **Extensive Experimental Validation**: - Conducted extensive experiments on two state-of-the-art Transformer-based text classifiers, demonstrating the effectiveness of the proposed method. ### Conclusion The paper proposes a novel adversarial text purification method based on large language models, which can effectively remove adversarial perturbations and generate semantically similar and correctly classified purified samples without explicitly representing the perturbations. The method performs excellently across various classifiers, significantly improving classification accuracy after adversarial attacks, and opens new directions for future research in text adversarial defenses.