A Watermark for Black-Box Language Models

Dara Bahri,John Wieting,Dana Alon,Donald Metzler
2024-10-03
Abstract:Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to add a watermark to the text generated by large language models (LLM) without white-box access, so that it can be detected whether the text was generated by a specific LLM. ### Background and Motivation - **Background**: It is crucial to understand whether a piece of text is generated by a large language model because the content generated by these models may be unreliable and even exhibit "hallucination" phenomena. - **Limitations of existing methods**: Most existing watermarking schemes require white-box access to the model's next word probability distribution, which is typically unavailable to downstream users. Therefore, these methods are not suitable for third-party users. - **Research Objective**: Propose a new watermarking scheme that only requires black-box access (i.e., the ability to sample sequences from the LLM) and has a distortion-free property, allowing the use of multiple nested keys. ### Main Contributions 1. **Black-box watermarking scheme**: A new watermarking algorithm is proposed that only requires black-box access and can effectively add a watermark to the text generated by LLM without modifying the model weights or training process. 2. **Performance guarantee**: Theoretical performance guarantees are provided, and experiments validate that this scheme can outperform existing white-box schemes in certain scenarios. 3. **Multi-key nesting**: The scheme supports the use of multiple keys for nested watermarking, enhancing flexibility and security. ### Method Overview - **Algorithm principle**: The scheme adopts an autoregressive approach, sampling multiple sequences from the LLM each time, scoring each sequence using a secret key, and selecting the highest-scoring sequence as the output. In this way, a watermark can be added to the generated text without modifying the internal structure of the model. - **Detection method**: The watermark is detected by calculating the score of the text. If the score is high, the text is likely to be watermarked. ### Experimental Results - **Detection performance**: Experiments validate that the scheme has good detection performance on texts of different lengths, especially performing well under low false positive rates (FPR). - **Distortion-free property**: Experiments demonstrate that the scheme does not significantly affect the quality of the generated text in most cases. - **Adversarial attacks**: The robustness of the scheme against two common attack strategies is studied, and results show that the scheme can resist these attacks to a certain extent. ### Conclusion This paper proposes a new black-box watermarking scheme that can effectively detect whether the text generated by LLM is watermarked without relying on white-box access. The scheme has a distortion-free property, supports multi-key nesting, and is suitable for third-party users, offering high practicality and security.