Edit Distance Robust Watermarks for Language Models

Noah Golowich,Ankur Moitra
2024-06-04
Abstract:Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn & Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both undetectability and robustness to edits when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our key idea is to handle adversarial insertions and deletions by interpreting the symbols as indices into the codeword, which we call indexing pseudorandom codes. Additionally, our codes rely on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the problem of how to embed watermarks with provable guarantees in the text generated by language models. Specifically, the author focuses on designing a watermarking scheme that can satisfy the following two key properties: 1. **Undetectability**: According to the cryptographic concept introduced in [CGZ24], the watermarked language model output is computationally indistinguishable from the actual output distribution of the model. This means that no polynomial - time algorithm can distinguish the outputs of the two by querying the watermarked model or the actual model multiple times. 2. **Robustness to Edits**: The watermarking scheme needs to be robust against channels that introduce a constant proportion of adversarial insertions, substitutions, and deletions. Early schemes could only handle random substitutions and deletions, while the goal of this paper is to achieve a natural and more appealing robustness guarantee for edit distance. ### Main Contributions The main contribution of the paper is to propose a watermarking scheme that achieves both undetectability and robustness to edits when the alphabet size of the language model is allowed to grow polynomially with the security parameter. To achieve this goal, the author adopts the following methods: 1. **Pseudorandom Codes (PRCs)**: First, construct pseudorandom codes that satisfy the undetectability and robustness properties. The symbols of these codes are interpreted as indices in the codewords, called Indexing Pseudorandom Codes. 2. **Conversion from a Large Alphabet to an Arbitrary Language Model**: Show how to convert these codes on a large alphabet into a watermarking scheme applicable to any language model. ### Technical Roadmap The technical route of the paper is as follows: 1. **Design Pseudorandom Codes on a Binary Alphabet**: Under Assumption 3.1, design a pseudorandom code on a binary alphabet that is robust against a constant proportion of adversarial substitutions (Theorem 3.2). 2. **Universal Conversion**: Give a universal conversion method to convert a pseudorandom code on a binary alphabet that is robust against substitutions into a pseudorandom code on a polynomial - size alphabet that is robust against a constant proportion of substitutions, insertions, and deletions (Theorem 4.1). 3. **Conversion from Pseudorandom Codes to a Watermarking Scheme**: Establish a universal conversion method to convert a pseudorandom code on a large alphabet that is robust against substitutions, insertions, and deletions into a watermarking scheme with similar robustness (Theorem 5.1). ### Main Results The main results of the paper (the informal version of Theorem 1.1) show that, under standard cryptographic assumptions, there exists a watermarking scheme that has the following properties for model outputs of polynomial length on a polynomial - size alphabet: - Soundness - Undetectability for polynomial - time algorithms - Robustness against a constant proportion of substitutions, insertions, and deletions ### Related Work - **Early Watermarking Work**: Early work mainly focused on embedding watermarks by subtly changing the text, but these methods usually significantly change the distribution of the generated content. - **Recent Work**: Some recent work embeds watermarks by preferentially generating certain tokens during the generation process, but these methods may introduce some obvious biases. - **Undetectability vs. Distortion - free**: Undetectability requires that no polynomial - time algorithm can distinguish the outputs of the watermarked model and the original model, while distortion - free requires that the distribution of a single text sample is exactly the same. Undetectability provides a stronger guarantee and is more suitable for practical applications. In conclusion, this paper successfully solves the problem of embedding watermarks with provable robustness in the text generated by language models by introducing new pseudorandom code techniques and universal conversion methods.