Abstract:We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, e.g. text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure until the victim classifier changes its decision. The evaluation confirms the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to explore methods for generating Adversarial Examples (AEs) to test the robustness of text classification algorithms that detect low-credibility content, including propaganda, false statements, rumors, and extremely biased news. Specifically, the authors focus on simulating real-world content moderation scenarios where attackers are limited in the number of queries they can make when attempting to generate adversarial examples. ### Main Research Background 1. **Modern Machine Learning Methods**: Modern machine learning methods have proven effective in determining the credibility of text in various scenarios, helping to tackle the challenge of misinformation. 2. **Content Moderation Systems**: Many large platforms, especially social media, use text classifiers as part of their content moderation systems. 3. **Generation of Adversarial Examples**: To assess the robustness of these solutions, the ease of generating adversarial examples is often examined. Adversarial examples are text samples that have been modified but still retain their original meaning, yet can trigger incorrect responses from the classifier. 4. **Limitations of Existing Methods**: - Existing methods require a large number of queries to generate a single adversarial example, which is impractical in real-world applications. - Word replacement strategies may lead to loss of meaning, making the generated adversarial examples unusable. ### Proposed Method The authors propose TREPAT (Tracing REcursive Paraphrasing for Adversarial examples from Transformers), a method that leverages large language models (LLMs) to generate adversarial examples. The specific steps are as follows: 1. **Segmenting Text**: The input text is divided into smaller segments to facilitate better rewriting. 2. **Rewriting Text**: Using LLMs and different prompts (such as rewriting, synonym replacement, simplification, style transformation, etc.), variants of the text are generated. 3. **Decomposing Changes**: The generated variants are decomposed into individual changes, which are then applied to the original text step-by-step through beam search until the classifier changes its decision. 4. **Applying Changes**: All changes from the variants are collected and gradually added to the original text to generate new text variants. If a new variant causes the classifier's response to change, a successful adversarial example is returned. ### Evaluation The evaluation scenario simulates real-world interactions with content moderation systems, with the number of allowed queries conforming to the actual limits of commercial platforms. Experimental results show that TREPAT outperforms baseline and state-of-the-art (SOTA) methods in most scenarios, especially when dealing with longer input texts. ### Conclusion The TREPAT method excels in generating adversarial examples under limited query scenarios, particularly when handling longer texts. However, there are still some limitations, such as the conservativeness of LLMs in generating sensitive or vulgar content, and the need for more manual effort in evaluating semantic retention. Future work can further optimize these methods to improve their effectiveness in real-world applications.

Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Verifying the Robustness of Automatic Credibility Assessment

Misleading Sentiment Analysis: Generating Adversarial Texts by the Ensemble Word Addition Algorithm

Rethinking Textual Adversarial Defense for Pre-trained Language Models

Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack

Exploring the Adversarial Capabilities of Large Language Models

Generating Natural Language Adversarial Examples Through Probability Weighted Word Saliency

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

Red Teaming Language Model Detectors with Language Models

Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion

OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation

Fake News Detectors are Biased against Texts Generated by Large Language Models

Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

Shielding Google's language toxicity model against adversarial attacks

Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model

A Generative Adversarial Attack for Multilingual Text Classifiers

Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Generating Valid and Natural Adversarial Examples with Large Language Models

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

Reversible Jump Attack to Textual Classifiers with Modification Reduction