PostMark: A Robust Blackbox Watermark for Large Language Models

Yapei Chang,Kalpesh Krishna,Amir Houmansadr,John Wieting,Mohit Iyyer
2024-10-12
Abstract:The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at <a class="link-external link-https" href="https://github.com/lilakk/PostMark" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the detection of text generated by large - language models (LLMs). Specifically, with the increasing deployment of large - language models in malicious applications, such as generating false content, this has brought a serious trust crisis to the network environment. Modern large - language models are known to "hallucinate" (i.e., generate content that does not conform to reality), and their outputs may contain biases and artifacts originating from the training data. If the Internet is flooded with millions of articles generated by LLMs, how can we trust the authenticity of the content we read? In addition, whether it is desirable for future LLM training to be based on the text generated by current LLMs is also a question worthy of consideration. To address this emerging problem, researchers have developed a variety of techniques for detecting LLM - generated text. These techniques mainly rely on watermark embedding, outlier detection, training classifiers, or retrieval - based methods. Among them, the method of embedding a detectable signature (i.e., a watermark) into the model output is considered the most effective and robust. However, most existing watermark algorithms require access to the logits of the underlying LLM (i.e., the output probability distribution of the model during the decoding process), which means that they can only be implemented by individual LLM API providers (such as OpenAI or Google). In addition, when the text generated by LLMs is modified by means such as rewriting, translation, or cropping, the effectiveness of these methods will decline. Therefore, this paper proposes a new post - processing watermark method - POSTMARK, aiming to solve the above problems. The feature of POSTMARK is that it does not require access to the logits of LLMs. Instead, it inserts a set of vocabulary based on the semantics of the input text through an independent instruction - following LLM, thereby completing watermark embedding without significantly changing the meaning of the text. This method can not only be applied by third - party entities to the output of API providers (such as OpenAI), but also shows higher robustness in the face of rewriting attacks. The paper verifies the effectiveness of POSTMARK through extensive experiments and explores the impact of watermarks on text quality, especially the trade - offs in coherence, relevance, interestingness, and factual accuracy.