Abstract:The advent of large language models (LLMs) has revolutionized the field of text generation, producing outputs that closely mimic human-like writing. Although academic and industrial institutions have developed detectors to prevent the malicious usage of LLM-generated texts, other research has doubt about the robustness of these systems. To stress test these detectors, we introduce a proxy-attack strategy that effortlessly compromises LLMs, causing them to produce outputs that align with human-written text and mislead detection systems. Our method attacks the source model by leveraging a reinforcement learning (RL) fine-tuned humanized small language model (SLM) in the decoding phase. Through an in-depth analysis, we demonstrate that our attack strategy is capable of generating responses that are indistinguishable to detectors, preventing them from differentiating between machine-generated and human-written text. We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-8*7B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%. Despite our proxy-attack strategy successfully bypassing the detectors with such significant relative drops, we find that the generation quality of the attacked models remains preserved, even within a modest utility budget, when compared to the text produced by the original, unattacked source model.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to effectively attack the text detectors generated by large - language models (LLMs), making it difficult for these detectors to distinguish between machine - generated text and human - written text**. Specifically, the researchers proposed a new method named HUMPA (Humanized Proxy Attack), which uses a small - language model (SLM) fine - tuned by reinforcement learning to attack the large - language model during the decoding stage, making its output closer to human - written text, thereby misleading existing detection systems. ### Specific Background of the Problem With the development of large - language models (such as ChatGPT, Llama, etc.), the text they generate is getting closer and closer to human - written text, which has raised concerns about their potential misuse, such as fake news, malicious content dissemination, and plagiarism. To address these issues, academia and industry have developed a variety of detectors to identify AI - generated text. However, the reliability and robustness of these detectors have been questioned, especially their poor performance when facing attacks. ### Main Contributions of the Paper 1. **Proposed a new attack strategy, HUMPA**: By using a small - language model fine - tuned by reinforcement learning as a proxy attacker, without directly fine - tuning the large - language model, change its output distribution to make it closer to human - written text. 2. **Theoretical analysis and experimental verification**: The paper theoretically proves that fine - tuning a small - language model can achieve an effect similar to directly attacking a large - language model, and verifies the effectiveness of this method through extensive experiments. 3. **Deceive detectors while maintaining generation quality**: The experimental results show that HUMPA can not only significantly reduce the performance of detectors (the average AUROC drops by 70.4%, with a maximum drop of 90.3%), but also maintain the quality of the generated text, ensuring its usability in practical applications. ### Method Overview - **Task Definition**: Given a set of prompts and responses, the goal is to find a new generation process \( M' \) such that the detector cannot distinguish between the text generated by \( M' \) and that generated by humans. - **Preference Reinforcement Learning**: Use the preference data set \( D=\{(x, y_w, y_l)\} \) to fine - tune the language model, where \( y_w \) and \( y_l \) are responses generated according to the reference model, and \( y_w \) is considered superior to \( y_l \). - **Proxy Attack Mechanism**: Adjust the next - token output distribution of the large - language model and use the logit offset of the small - language model to implement the attack. The specific formula is as follows: \[ \pi_{M'}(y_t|x, y_{<t})=\frac{1}{Z_{x,y_{<t}}}\pi_{\text{ref}}^M(y_t|x, y_{<t})\left(\frac{\pi^s_M(y_t|x, y_{<t})}{\pi_{\text{ref}}^s(y_t|x, y_{<t})}\right)^\alpha \] where \( Z_{x,y_{<t}} \) is the normalization factor and \( \alpha \) is the attack ratio. ### Experimental Results The paper conducted experiments in white - box and black - box settings using multiple data sets (such as OpenWebText, WritingPrompts, PubMedQA, etc.) to verify the effectiveness of HUMPA. The experimental results show that HUMPA can significantly reduce the performance of detectors in different scenarios (across disciplines and languages) while maintaining the quality of the generated text. ### Summary This paper proposes an innovative attack strategy, HUMPA. By using a small - language model as a proxy attacker, it successfully deceives existing text detectors, demonstrating the vulnerability of current detection systems. This research provides an important reference for the development of more robust detection methods in the future.

Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

Red Teaming Language Model Detectors with Language Models

Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack

Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

RAFT: Realistic Attacks to Fool Text Detectors

Large Language Models can be Guided to Evade AI-Generated Text Detection

DROJ: A Prompt-Driven Attack against Large Language Models

Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

Are You Human? An Adversarial Benchmark to Expose LLMs

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination

Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models