Baselines for Identifying Watermarked Large Language Models

Leonard Tang,Gavin Uberti,Tom Shlomi
2023-05-29
Abstract:We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. Along the way, we formalize the specific problem of identifying watermarks in LLMs, as well as LLM watermarks and watermark detection in general, providing a framework and foundations for studying them.
Machine Learning,Artificial Intelligence,Cryptography and Security,Computers and Society
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to identify whether there are watermarks in large language models (LLMs). Specifically, the author focuses on the identification of watermarking schemes in widely - used, publicly - hosted, and closed - source large language models. A watermark refers to a hidden pattern embedded in the generated text. These patterns are invisible to humans but can be detected by algorithms to prove that the text is generated by a specific AI system. #### Background and Motivation With the development of large language models, they are able to generate convincing human - like texts, which has raised concerns that these models may be used to spread false information, plagiarize, or maliciously impersonate others. Therefore, researchers have begun to develop methods to detect AI - generated texts. One of these methods is through watermarking technology, that is, subtly modifying certain features in the output text for subsequent detection. However, existing research mainly focuses on determining whether a given text is generated by a watermarked model, while this paper focuses on how to determine whether a language model itself is watermarked. The author proposes a set of benchmark algorithms to identify by analyzing the output tokens and logits distributions generated by watermarked and non - watermarked LLMs. These algorithms only need to query the model and do not need to know the underlying watermark parameters. #### Main Contributions 1. **Introduction of Benchmark Algorithms**: The author proposes several benchmark algorithms based on the analysis of output tokens and logits distributions for identifying watermarks in LLMs. 2. **Formal Problem Definition**: The paper clarifies the specific problem of identifying LLM watermarks and provides a framework and basis for researching this field. 3. **Evaluation of Watermarks with Different Strengths**: Studies the identifiability of watermarks with different strengths and discusses the trade - offs of each identification mechanism in different watermark scenarios. #### Key Formulas - **Lorenz Curve and Gini Coefficient**: \[ G=\frac{\sum_{i = 1}^n\sum_{j = 1}^n|x_i - x_j|}{2n^2\bar{x}} \] where \(x_i\) and \(x_j\) are the probabilities of the \(i\) - th and \(j\) - th tokens in order, and \(\bar{x}\) is the average probability. The Gini coefficient is used to measure the degree of inequality in the distribution. A low value indicates a smoother distribution, suggesting the presence of a watermark. - **Kolmogorov - Smirnov Statistic**: \[ D_{n,m}=\sup_x|F_{u,n}(x)-F_{w,m}(x)| \] Here \(F_{u,n}(x)\) and \(F_{w,m}(x)\) are the empirical distribution functions of the random numbers generated by the non - watermarked and watermarked models respectively, and \(n\) and \(m\) are the sample sizes. Through these methods, the author hopes to provide a solid foundation for future research, especially for identifying watermarked LLMs without accessing the underlying watermark parameters.