Learning to Rewrite: Generalized LLM-Generated Text Detection

Wei Hao,Ran Li,Weiliang Zhao,Junfeng Yang,Chengzhi Mao
2024-08-08
Abstract:Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation. Detecting LLM-generated content is essential to mitigate these risks, but current classifiers often fail to generalize in open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated content less frequently, which can be used for detection and naturally generalizes to unforeseen data. However, we find that the rewriting edit distance between human and LLM content can be indistinguishable across domains, leading to detection failures. We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text, deriving a distinguishable and generalizable edit distance difference across different domains. Experiments on text from 21 independent domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that our classifier outperforms the state-of-the-art zero-shot classifier by up to 20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of detecting content generated by large language models (LLMs). Specifically, since LLMs can be massively misused to create non-factual content and spread misinformation, it has become particularly urgent to develop reliable algorithms to detect LLM-generated content. However, current detection methods often struggle to generalize in open-world environments. The paper proposes a method called L2R (Learning to Rewrite), which trains an LLM to perform rewriting operations, making more edits to human text and fewer edits to LLM-generated text, thereby deriving distinguishable and generalizable edit distance differences across different domains. Experimental results show that L2R outperforms state-of-the-art zero-shot classifiers in terms of classifier performance on text data from 21 independent domains for three popular LLMs (such as GPT-4, Gemini, and Llama-3), with an AUROC score improvement of up to 20.6% and an F1 score improvement of 9.2% over the rewriting classifier. This indicates that, if properly trained, LLMs can effectively detect machine-generated text.