Abstract:The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at <a class="link-external link-https" href="https://github.com/lilakk/PostMark" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the detection of text generated by large - language models (LLMs). Specifically, with the increasing deployment of large - language models in malicious applications, such as generating false content, this has brought a serious trust crisis to the network environment. Modern large - language models are known to "hallucinate" (i.e., generate content that does not conform to reality), and their outputs may contain biases and artifacts originating from the training data. If the Internet is flooded with millions of articles generated by LLMs, how can we trust the authenticity of the content we read? In addition, whether it is desirable for future LLM training to be based on the text generated by current LLMs is also a question worthy of consideration. To address this emerging problem, researchers have developed a variety of techniques for detecting LLM - generated text. These techniques mainly rely on watermark embedding, outlier detection, training classifiers, or retrieval - based methods. Among them, the method of embedding a detectable signature (i.e., a watermark) into the model output is considered the most effective and robust. However, most existing watermark algorithms require access to the logits of the underlying LLM (i.e., the output probability distribution of the model during the decoding process), which means that they can only be implemented by individual LLM API providers (such as OpenAI or Google). In addition, when the text generated by LLMs is modified by means such as rewriting, translation, or cropping, the effectiveness of these methods will decline. Therefore, this paper proposes a new post - processing watermark method - POSTMARK, aiming to solve the above problems. The feature of POSTMARK is that it does not require access to the logits of LLMs. Instead, it inserts a set of vocabulary based on the semantics of the input text through an independent instruction - following LLM, thereby completing watermark embedding without significantly changing the meaning of the text. This method can not only be applied by third - party entities to the output of API providers (such as OpenAI), but also shows higher robustness in the face of rewriting attacks. The paper verifies the effectiveness of POSTMARK through extensive experiments and explores the impact of watermarks on text quality, especially the trade - offs in coherence, relevance, interestingness, and factual accuracy.

PostMark: A Robust Blackbox Watermark for Large Language Models

Mark My Words: Analyzing and Evaluating Language Model Watermarks

MarkLLM: An Open-Source Toolkit for LLM Watermarking

A Watermark for Black-Box Language Models

A Robust Semantics-based Watermark for Large Language Model against Paraphrasing

Black-Box Detection of Language Model Watermarks

REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models

Watermarking Text Generated by Black-Box Language Models

On the Reliability of Watermarks for Large Language Models

Towards Codable Watermarking for Injecting Multi-bits Information to LLMs

Baselines for Identifying Watermarked Large Language Models

De-mark: Watermark Removal in Large Language Models

WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

Robust Distortion-free Watermarks for Language Models

Signal Watermark on Large Language Models

Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice

A Watermark for Large Language Models

Adaptive Text Watermark for Large Language Models

Segmenting Watermarked Texts From Language Models

Towards Codable Text Watermarking for Large Language Models

MCGMark: An Encodable and Robust Online Watermark for LLM-Generated Malicious Code