A Watermark for Low-entropy and Unbiased Generation in Large Language Models

Minjia Mao,Dongjun Wei,Zeyu Chen,Xiao Fang,Michael Chau
2024-10-16
Abstract:Recent advancements in large language models (LLMs) have highlighted the risk of misusing them, raising the need for accurate detection of LLM-generated content. In response, a viable solution is to inject imperceptible identifiers into LLMs, known as watermarks. Previous work demonstrates that unbiased watermarks ensure unforgeability and preserve text quality by maintaining the expectation of the LLM output probability distribution. However, previous unbiased watermarking methods suffer from one or more of the following issues: (1) requiring access to white-box LLMs during detection, (2) incurring long detection time, (3) being not robust against simple watermarking attacks, (4) failing to provide statistical guarantees for the type II error of watermark detection, and (5) being not statistically unbiased for low-entropy scenarios, which hinder their deployment in practice. This study proposes the Sampling One Then Accepting (STA-1) method, a watermark that can address all of these issues. Moreover, we discuss the tradeoff between watermark strength and text quality for unbiased watermarks. We show that in low-entropy scenarios, unbiased watermarks face a tradeoff between watermark strength and the risk of unsatisfactory outputs. Experimental results on both low-entropy and high-entropy datasets demonstrate that STA-1 achieves text quality and watermark strength comparable to existing unbiased watermarks, with a low risk of unsatisfactory outputs. Implementation codes for this study are available online.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to inject watermarks into the content generated by large - language models (LLMs) in order to accurately detect this content, while ensuring text quality and preventing forgery. Specifically, the paper proposes improvement schemes for several problems existing in the existing unbiased watermarking methods: 1. **Requiring access to white - box LLMs**: Existing unbiased watermarking methods usually require access to the internal structure of the model during detection, which may not be feasible in practical applications. 2. **Long detection time**: The detection process of some methods is time - consuming, which affects the efficiency of their practical applications. 3. **Not robust**: Existing methods perform poorly in the face of simple watermark attacks and are easily bypassed. 4. **Lack of statistical guarantees**: For the second - type error in watermark detection (i.e., failing to detect watermarked content), existing methods lack statistical guarantees. 5. **Not unbiased in low - entropy scenarios**: In low - entropy scenarios, the effectiveness of existing unbiased watermarking methods is poor and they cannot maintain unbiasedness. To solve these problems, the paper proposes a new watermarking method - Sampling One Then Accepting (STA - 1), and conducts experimental verification on low - entropy and high - entropy datasets. The main features of the STA - 1 method are as follows: - **Unbiasedness**: The STA - 1 method can keep the expected probability distributions before and after watermarking consistent, ensuring the unbiasedness of the watermark. - **Efficient detection**: The detection time complexity of the STA - 1 method is \(O(m)\), where \(m\) is the number of generated tokens, and the detection efficiency is high. - **Robustness**: The STA - 1 method has good resistance to simple insertion and deletion attacks. - **Statistical guarantees**: The STA - 1 method provides statistical guarantees for the second - type error in watermark detection. In addition, the paper also discusses the trade - off between the strength of unbiased watermarking and text quality in low - entropy scenarios, and proposes an extended method - Sampling M Then Accepting (STA - M). By repeating sampling in high - entropy steps, it enhances the watermark strength while trying to maintain text quality. ### Main contributions of the paper 1. **Proposing the STA - 1 method**: A practical and statistically - guaranteed unbiased watermarking method. 2. **Clarifying the trade - off between the strength of unbiased watermarking and text quality**: In low - entropy scenarios, unbiased watermarking still faces a trade - off between strength and text quality. 3. **Experimental verification**: The experimental results on public low - entropy and high - entropy datasets show that the STA - 1 method is comparable in performance to other unbiased watermarking methods and has a lower risk of unsatisfactory output. The STA - M method shows high watermark strength on low - entropy datasets and robustness to different watermark attacks.