Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

Yihan Wu,Ruibo Chen,Zhengmian Hu,Yanshuo Chen,Junfeng Guo,Hongyang Zhang,Heng Huang
2024-06-02
Abstract:Language model (LM) watermarking techniques inject a statistical signal into LM-generated content by substituting the random sampling process with pseudo-random sampling, using watermark keys as the random seed. Among these statistical watermarking approaches, distortion-free watermarks are particularly crucial because they embed watermarks into LM-generated content without compromising generation quality. However, one notable limitation of pseudo-random sampling compared to true-random sampling is that, under the same watermark keys (i.e., key collision), the results of pseudo-random sampling exhibit correlations. This limitation could potentially undermine the distortion-free property. Our studies reveal that key collisions are inevitable due to the limited availability of watermark keys, and existing distortion-free watermarks exhibit a significant distribution bias toward the original LM distribution in the presence of key collisions. Moreover, achieving a perfect distortion-free watermark is impossible as no statistical signal can be embedded under key collisions. To reduce the distribution bias caused by key collisions, we introduce a new family of distortion-free watermarks--beta-watermark. Experimental results support that the beta-watermark can effectively reduce the distribution bias under key collisions.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address a key issue in language model (LM) watermarking technology: whether distortion-free watermarks are truly distortion-free in the case of watermark key collisions. **Specific issues include:** 1. **Impact of Watermark Key Collisions**: - When using the same watermark key, pseudo-random sampling introduces correlations, which may cause the generated content to be identical across multiple generations, thus affecting the distortion-free nature of the watermark. - The number of watermark keys is limited, making key collisions inevitable, which further restricts the application scenarios of distortion-free watermarks. 2. **Limitations of Existing Distortion-Free Watermarks**: - Existing distortion-free watermarks cannot completely maintain the distribution of the original language model in the case of key collisions. - The authors demonstrate through theoretical and experimental evidence that existing distortion-free watermarks exhibit significant distribution bias in multiple generations, failing to achieve strong distortion-free. 3. **Proposing a New Solution**: - To reduce the distribution bias caused by key collisions, the authors introduce a new distortion-free watermark—beta-watermark. - Through experimental validation, beta-watermark effectively reduces distribution bias in the case of key collisions and designs a model-agnostic detection method. ### Main Contributions 1. **Definition of Three Distortion-Free Capabilities**: - **Step-wise distortion-free**: The watermark maintains the language model's distribution in a single token generation step. - **Weakly distortion-free**: The watermark maintains the language model's distribution in a single sentence generation. - **Strongly distortion-free**: The watermark maintains the language model's distribution in multiple sentence generations. - The authors prove that existing distortion-free watermarks are weakly distortion-free but not strongly distortion-free. 2. **Theoretical Proof of Trade-offs under Key Collisions**: - Under key collisions, there is a trade-off between the strength of the watermark and the distribution bias. Smaller distribution bias leads to weaker watermark strength. - Specifically, the authors prove that under key collisions, the distribution bias of a strongly distortion-free watermark is always zero, corresponding to zero watermark strength, thus strongly distortion-free watermarks do not exist. 3. **Introduction of Beta-Watermark**: - Beta-watermark is a new weakly distortion-free watermark that effectively reduces distribution bias caused by key collisions. - A model-agnostic detection method is designed, capable of identifying watermarks without accessing prompts or specific language models. - The effectiveness of beta-watermark is validated through experiments on widely studied language models such as BART-large and LLaMA-2. ### Conclusion Through theoretical analysis and experimental evidence, this paper demonstrates that existing distortion-free watermarks cannot achieve strong distortion-free under key collisions. To address this issue, the authors propose beta-watermark, which effectively reduces distribution bias under key collisions and designs a model-agnostic detection method. These contributions provide new directions for the development of language model watermarking technology.