Abstract:Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to add a watermark to the text generated by large language models (LLM) without white-box access, so that it can be detected whether the text was generated by a specific LLM. ### Background and Motivation - **Background**: It is crucial to understand whether a piece of text is generated by a large language model because the content generated by these models may be unreliable and even exhibit "hallucination" phenomena. - **Limitations of existing methods**: Most existing watermarking schemes require white-box access to the model's next word probability distribution, which is typically unavailable to downstream users. Therefore, these methods are not suitable for third-party users. - **Research Objective**: Propose a new watermarking scheme that only requires black-box access (i.e., the ability to sample sequences from the LLM) and has a distortion-free property, allowing the use of multiple nested keys. ### Main Contributions 1. **Black-box watermarking scheme**: A new watermarking algorithm is proposed that only requires black-box access and can effectively add a watermark to the text generated by LLM without modifying the model weights or training process. 2. **Performance guarantee**: Theoretical performance guarantees are provided, and experiments validate that this scheme can outperform existing white-box schemes in certain scenarios. 3. **Multi-key nesting**: The scheme supports the use of multiple keys for nested watermarking, enhancing flexibility and security. ### Method Overview - **Algorithm principle**: The scheme adopts an autoregressive approach, sampling multiple sequences from the LLM each time, scoring each sequence using a secret key, and selecting the highest-scoring sequence as the output. In this way, a watermark can be added to the generated text without modifying the internal structure of the model. - **Detection method**: The watermark is detected by calculating the score of the text. If the score is high, the text is likely to be watermarked. ### Experimental Results - **Detection performance**: Experiments validate that the scheme has good detection performance on texts of different lengths, especially performing well under low false positive rates (FPR). - **Distortion-free property**: Experiments demonstrate that the scheme does not significantly affect the quality of the generated text in most cases. - **Adversarial attacks**: The robustness of the scheme against two common attack strategies is studied, and results show that the scheme can resist these attacks to a certain extent. ### Conclusion This paper proposes a new black-box watermarking scheme that can effectively detect whether the text generated by LLM is watermarked without relying on white-box access. The scheme has a distortion-free property, supports multi-key nesting, and is suitable for third-party users, offering high practicality and security.

A Watermark for Black-Box Language Models

Black-Box Detection of Language Model Watermarks

Watermarking Text Generated by Black-Box Language Models

PostMark: A Robust Blackbox Watermark for Large Language Models

Publicly-Detectable Watermarking for Language Models

A Watermark for Large Language Models

Baselines for Identifying Watermarked Large Language Models

Provably Robust Watermarks for Open-Source Language Models

Mark My Words: Analyzing and Evaluating Language Model Watermarks

Performance-lossless Black-box Model Watermarking

Let Watermarks Speak: A Robust and Unforgeable Watermark for Language Models

Unbiased Watermark for Large Language Models

Large Language Model Watermark Stealing With Mixed Integer Programming

Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice

NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models

Watermark Stealing in Large Language Models

Proving membership in LLM pretraining data via data watermarks

Adaptive Text Watermark for Large Language Models

Towards Codable Watermarking for Injecting Multi-bits Information to LLMs

Signal Watermark on Large Language Models

A Watermark for Low-entropy and Unbiased Generation in Large Language Models