Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Piotr Przybyła
2024-10-28
Abstract:We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, e.g. text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure until the victim classifier changes its decision. The evaluation confirms the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to explore methods for generating Adversarial Examples (AEs) to test the robustness of text classification algorithms that detect low-credibility content, including propaganda, false statements, rumors, and extremely biased news. Specifically, the authors focus on simulating real-world content moderation scenarios where attackers are limited in the number of queries they can make when attempting to generate adversarial examples. ### Main Research Background 1. **Modern Machine Learning Methods**: Modern machine learning methods have proven effective in determining the credibility of text in various scenarios, helping to tackle the challenge of misinformation. 2. **Content Moderation Systems**: Many large platforms, especially social media, use text classifiers as part of their content moderation systems. 3. **Generation of Adversarial Examples**: To assess the robustness of these solutions, the ease of generating adversarial examples is often examined. Adversarial examples are text samples that have been modified but still retain their original meaning, yet can trigger incorrect responses from the classifier. 4. **Limitations of Existing Methods**: - Existing methods require a large number of queries to generate a single adversarial example, which is impractical in real-world applications. - Word replacement strategies may lead to loss of meaning, making the generated adversarial examples unusable. ### Proposed Method The authors propose TREPAT (Tracing REcursive Paraphrasing for Adversarial examples from Transformers), a method that leverages large language models (LLMs) to generate adversarial examples. The specific steps are as follows: 1. **Segmenting Text**: The input text is divided into smaller segments to facilitate better rewriting. 2. **Rewriting Text**: Using LLMs and different prompts (such as rewriting, synonym replacement, simplification, style transformation, etc.), variants of the text are generated. 3. **Decomposing Changes**: The generated variants are decomposed into individual changes, which are then applied to the original text step-by-step through beam search until the classifier changes its decision. 4. **Applying Changes**: All changes from the variants are collected and gradually added to the original text to generate new text variants. If a new variant causes the classifier's response to change, a successful adversarial example is returned. ### Evaluation The evaluation scenario simulates real-world interactions with content moderation systems, with the number of allowed queries conforming to the actual limits of commercial platforms. Experimental results show that TREPAT outperforms baseline and state-of-the-art (SOTA) methods in most scenarios, especially when dealing with longer input texts. ### Conclusion The TREPAT method excels in generating adversarial examples under limited query scenarios, particularly when handling longer texts. However, there are still some limitations, such as the conservativeness of LLMs in generating sensitive or vulgar content, and the need for more manual effort in evaluating semantic retention. Future work can further optimize these methods to improve their effectiveness in real-world applications.