Abstract:The advent of large language models (LLMs) has revolutionized the field of text generation, producing outputs that closely mimic human-like writing. Although academic and industrial institutions have developed detectors to prevent the malicious usage of LLM-generated texts, other research has doubt about the robustness of these systems. To stress test these detectors, we introduce a proxy-attack strategy that effortlessly compromises LLMs, causing them to produce outputs that align with human-written text and mislead detection systems. Our method attacks the source model by leveraging a reinforcement learning (RL) fine-tuned humanized small language model (SLM) in the decoding phase. Through an in-depth analysis, we demonstrate that our attack strategy is capable of generating responses that are indistinguishable to detectors, preventing them from differentiating between machine-generated and human-written text. We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-8*7B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%. Despite our proxy-attack strategy successfully bypassing the detectors with such significant relative drops, we find that the generation quality of the attacked models remains preserved, even within a modest utility budget, when compared to the text produced by the original, unattacked source model.

Largemargin Classification for Combating Disguise Attacks on Spam Filters

Analyzing and Detecting Adversarial Spam on a Large-scale Online APP Review System.

Camouflage is NOT Easy: Uncovering Adversarial Fraudsters in Large Online App Review Platform

A Game Model for Adversarial Classification in Spam Filtering

Misleading Sentiment Analysis: Generating Adversarial Texts by the Ensemble Word Addition Algorithm

Intelligent Detection Approaches for Spam

A Local-Concentration-Based Feature Extraction Approach for Spam Filtering.

Training SVM Email Classifiers Using Very Large Imbalanced Dataset

Voting for Deceptive Opinion Spam Detection

Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

AdverSPAM: Adversarial SPam Account Manipulation in Online Social Networks

Concentration Based Feature Construction Approach for Spam Detection.

Extracting discriminative information from e-mail for spam detection inspired by Immune System

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

TopicSpam: a Topic-Model Based Approach for Spam Detection.

Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Investigating the Effectiveness of Bayesian Spam Filters in Detecting LLM-modified Spam Mails

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

An Adaptive Concentration Selection Model for Spam Detection.