Black-Box Detection of Language Model Watermarks

Thibaud Gloaguen,Nikola Jovanović,Robin Staab,Martin Vechev
2024-07-13
Abstract:Watermarking has emerged as a promising way to detect LLM-generated text. To apply a watermark an LLM provider, given a secret key, augments generations with a signal that is later detectable by any party with the same key. Recent work has proposed three main families of watermarking schemes, two of which focus on the property of preserving the LLM distribution. This is motivated by it being a tractable proxy for maintaining LLM capabilities, but also by the idea that concealing a watermark deployment makes it harder for malicious actors to hide misuse by avoiding a certain LLM or attacking its watermark. Yet, despite much discourse around detectability, no prior work has investigated if any of these scheme families are detectable in a realistic black-box setting. We tackle this for the first time, developing rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries. We experimentally confirm the effectiveness of our methods on a range of schemes and a diverse set of open-source models. Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries. We further apply our methods to test for watermark presence behind the most popular public APIs: GPT4, Claude 3, Gemini 1.0 Pro, finding no strong evidence of a watermark at this point in time.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the detectability of existing large language model (LLM) watermarking schemes in actual black - box environments. Specifically, the paper focuses on whether three major families of watermarking schemes - Red - Green, Fixed - Sampling, and Cache - Augmented watermarking schemes - can be effectively detected in real - world applications. Prior to this, although there has been much discussion about watermarking schemes, no research has systematically explored the detectability of these schemes in actual black - box settings. ### Core Problems of the Paper 1. **Detectability of Watermarking Schemes**: For the first time, the paper develops strict statistical testing methods to detect the existence of these three major families of watermarking schemes, using only a limited number of black - box queries. These testing methods aim to verify whether current watermarking schemes are as difficult to detect as expected, especially in cases where malicious actors attempt to hide or attack the watermarks. 2. **Effectiveness in Practical Applications**: Through experiments, the paper confirms the effectiveness of these testing methods and validates them on multiple open - source models. The experimental results show that current watermarking schemes are more detectable than previously thought, implying that simply hiding the watermark deployment may not be sufficient to protect the model from malicious attacks. 3. **Watermark Detection in Public APIs**: The paper further applies these testing methods to the most popular public APIs (such as GPT - 4, Claude 3, Gemini 1.0 Pro) to detect whether these models use watermarks. The results show that there is currently no strong evidence indicating that these models have deployed watermarks. ### Main Contributions - **Proposing New Detection Methods**: The paper proposes strict statistical testing methods for Red - Green, Fixed - Sampling, and Cache - Augmented watermarking schemes. - **Extensive Experimental Verification**: Through a large number of experiments, the effectiveness and robustness of these testing methods under multiple models and parameter settings are verified. - **Verification in Practical Applications**: Applying these testing methods to actual black - box LLM deployments provides important insights into the current state of watermark deployment. ### Conclusion The conclusions of the paper pose new challenges for future watermarking scheme design and evaluation. Although watermarking schemes can theoretically maintain the capabilities of the model, more attention needs to be paid to the detectability issues in practical applications. The paper suggests that future research should place more emphasis on other relevant properties of watermarking schemes, such as attack robustness, text quality, and efficiency, rather than just non - detectability.