Abstract:Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks without knowing the type of attacks or training of the classifier. These techniques characterize and eliminate adversarial perturbations from the attacked inputs, aiming to restore purified samples that retain similarity to the initially attacked ones and are correctly classified by the classifier. Due to the inherent challenges associated with characterizing noise perturbations for discrete inputs, adversarial text purification has been relatively unexplored. In this paper, we investigate the effectiveness of adversarial purification methods in defending text classifiers. We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models (LLMs) to purify adversarial text without the need to explicitly characterize the discrete noise perturbations. We utilize prompt engineering to exploit LLMs for recovering the purified examples for given adversarial examples such that they are semantically similar and correctly classified. Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the vulnerability of text classification models when faced with adversarial attacks. Specifically, the authors explore how to use large language models (LLMs) for adversarial text purification to defend against these attacks. The goal of adversarial text purification is to identify and eliminate adversarial perturbations from the attacked input, thereby restoring a purified sample that is similar to the original input and can be correctly classified. ### Background and Challenges 1. **Threat of Adversarial Attacks**: - Adversarial attacks cause text classification models to misclassify by adding small but carefully crafted perturbations to the input data. - These attacks pose a serious threat to the reliability and integrity of natural language processing (NLP) applications. 2. **Current State of Adversarial Purification**: - Adversarial purification is a defense mechanism that generates purified samples by removing adversarial perturbations from the input. - Although successful in the field of image classification, research on adversarial purification in text classification is relatively scarce due to the discrete nature of text inputs. ### Solution 1. **Utilizing Large Language Models**: - The authors propose an adversarial text purification method based on large language models (such as GPT-3.5). - Through prompt engineering, they leverage the generative capabilities and contextual understanding of LLMs to directly generate purified samples from adversarial texts without explicitly representing the perturbations. 2. **Experimental Validation**: - The authors conducted experiments on two commonly used NLP datasets (IMDb and AG News) to validate the effectiveness of the proposed method. - Experimental results show that the method significantly improves classification accuracy after adversarial attacks across various classifiers, with an average improvement of over 65%. ### Main Contributions 1. **Exploring the Feasibility of Adversarial Text Purification**: - Investigated whether text adversarial purification defenses can be effectively implemented. 2. **First Use of LLMs for Text Adversarial Purification**: - Proposed an effective text adversarial purification method by leveraging the contextual understanding and generative capabilities of LLMs. 3. **Extensive Experimental Validation**: - Conducted extensive experiments on two state-of-the-art Transformer-based text classifiers, demonstrating the effectiveness of the proposed method. ### Conclusion The paper proposes a novel adversarial text purification method based on large language models, which can effectively remove adversarial perturbations and generate semantically similar and correctly classified purified samples without explicitly representing the perturbations. The method performs excellently across various classifiers, significantly improving classification accuracy after adversarial attacks, and opens new directions for future research in text adversarial defenses.

Adversarial Text Purification: A Large Language Model Approach for Defense

Text Adversarial Purification As Defense Against Adversarial Attacks

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Language Guided Adversarial Purification

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

Rethinking Textual Adversarial Defense for Pre-trained Language Models

Exploring the Adversarial Capabilities of Large Language Models

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Universal and Transferable Adversarial Attacks on Aligned Language Models

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Adversarial Evasion Attack Efficiency against Large Language Models

Defensive Dual Masking for Robust Adversarial Defense

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

TextDefense: Adversarial Text Detection based on Word Importance Entropy