Watermarking Language Models for Many Adaptive Users

Aloni Cohen,Alexander Hoover,Gabe Schoenbach
2024-06-29
Abstract:We study watermarking schemes for language models with provable guarantees. As we show, prior works offer no robustness guarantees against adaptive prompting: when a user queries a language model more than once, as even benign users do. And with just a single exception (Christ and Gunn, 2024), prior works are restricted to zero-bit watermarking: machine-generated text can be detected as such, but no additional information can be extracted from the watermark. Unfortunately, merely detecting AI-generated text may not prevent future abuses. We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users, even in the face of adaptive prompting. We construct multi-user watermarking schemes from undetectable, adaptively robust, zero-bit watermarking schemes (and prove that the undetectable zero-bit scheme of Christ, Gunn, and Zamir (2024) is adaptively robust). Importantly, our scheme provides both zero-bit and multi-user assurances at the same time. It detects shorter snippets just as well as the original scheme, and traces longer excerpts to individuals. The main technical component is a construction of message-embedding watermarks from zero-bit watermarks. Ours is the first generic reduction between watermarking schemes for language models. A challenge for such reductions is the lack of a unified abstraction for robustness -- that marked text is detectable even after edits. We introduce a new unifying abstraction called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text "approximates enough blocks" of model-generated output.
Cryptography and Security,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the two major deficiencies in existing language model watermarking schemes: **robustness of adaptive prompts** and **multi - user tracking ability**. Specifically: 1. **Robustness of adaptive prompts**: Existing watermarking schemes lack robustness guarantees when facing multiple user queries (i.e., adaptive prompts). Even well - intentioned users will optimize the generated content through multiple interactions. However, the existing watermark definitions and proofs are only applicable to single - query situations and cannot ensure that the generated content still has a watermark after multiple queries. 2. **Multi - user tracking ability**: Merely detecting whether a text is generated by AI is not enough to prevent future abuse. For example, in some cases, it is necessary to be able to track specific users or user groups, especially when these users may conspire to use AI to generate harmful content. Most existing watermarking schemes can only perform zero - bit watermarking (i.e., can only detect AI - generated text but cannot extract additional information) and cannot achieve multi - user tracking. To solve these problems, the author introduced the following innovations: - **Multi - user watermarking scheme**: Allows the text generated by the model to be traced back to specific users or user groups, and remains robust even in the case of user - adaptive prompts. - **Construction from zero - bit watermarking to message - embedded watermarking**: Provides a method to construct an L - bit watermarking scheme from the existing zero - bit watermarking scheme, thereby being able to embed more useful information in the generated text. - **AEB - robustness framework**: Introduces a new abstract framework, called AEB - robustness (Approximates Enough Blocks), to describe the robustness and integrity of the watermarking scheme, ensuring that the watermark can still be detected even after the text has been edited. Through these innovations, the paper aims to provide a more robust and versatile language model watermarking scheme to deal with complex usage scenarios and potential security threats in the real world. ### Formula summary - **AEB - robustness**: \[ \text{AEB - robustness}=\text{Approximates Enough Blocks} \] That is, the text must approximate a sufficient number of model - generated text blocks to be considered watermarked. - **δ - erasure ball**: \[ B_\delta(y)=\{z\in\{0, 1,\bot\}^L:z_i = \bot\text{ for at most }\lfloor\delta L\rfloor\text{ indices }i,\text{ and otherwise }z_i = y_i\} \] Used to describe the effectiveness of maintaining the watermark after a certain proportion of bits are erased. - **Minimizing the number of empty bins**: \[ k^*(L,\delta)=\min\left\{L\cdot(\ln L+\lambda),L\cdot\ln\left(\frac{1}{\delta}-\sqrt{\frac{\lambda+\ln 2}{2L}}\right)\right\} \] Used to analyze the probability of ensuring that at most \(\delta L\) bins are empty when randomly assigning balls to bins. These formulas and concepts together form the technical basis in the paper, ensuring the effectiveness and robustness of the watermarking scheme in complex environments.