Towards Better Statistical Understanding of Watermarking LLMs

Zhongze Cai,Shang Liu,Hanzhao Wang,Huaiyang Zhong,Xiaocheng Li
2024-03-19
Abstract:In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free'' and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms.
Machine Learning,Cryptography and Security,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve better watermarking techniques in large - language models (LLMs). Specifically, the author focuses on the problem of how to improve watermark detection capabilities without significantly distorting the model performance. The paper formalizes this challenge as a constrained optimization problem. Based on the green - red algorithm framework proposed by Kirchenbauer et al. (2023a), it explores the trade - off between model distortion and detection capabilities and proposes an online dual - gradient - ascent watermarking algorithm to solve this optimization problem. ### Main research questions: 1. **Trade - off between model distortion and detection capabilities**: By defining model distortion (using KL divergence as a metric) and detection capabilities (measured by increasing the probability of the green list), the paper formalizes the core problem in watermarking techniques as a constrained optimization problem. The goal is to minimize the distortion of the original model while ensuring a certain level of detection capabilities. 2. **Design and analysis of the optimization algorithm**: In order to effectively solve the above - mentioned optimization problem, the paper develops an online dual - gradient - ascent watermarking algorithm and proves its asymptotic Pareto optimality between model distortion and detection capabilities. This means that the algorithm can provide high detection capabilities while maintaining low model distortion. 3. **Selection of model distortion metrics**: The paper also systematically discusses the criteria for selecting model distortion metrics, especially why KL divergence is chosen as the metric. The author points out that there are some problems with the existing "distortion - free" criteria and perplexity differences and proves the superiority of KL divergence through theoretical analysis and experiments. ### Main contributions of the paper: - **Formalization of the optimization problem**: The paper formalizes the trade - off between model distortion and detection capabilities in watermarking techniques as a constrained optimization problem and provides detailed mathematical derivations. - **Online dual - gradient - ascent algorithm**: Proposes an online dual - gradient - ascent watermarking algorithm, which shows good performance both theoretically and experimentally and can achieve asymptotic Pareto optimality. - **Rationality of model distortion metrics**: Proves the rationality and superiority of KL divergence as a model distortion metric through theoretical analysis and experiments, providing a theoretical basis for future research. ### Mathematical formulas: - **KL divergence**: Used to measure the difference between two distributions \( Q \) and \( P \), defined as follows: \[ D_{\text{KL}}(Q \| P) = \int \log \left( \frac{dQ}{dP} \right) dQ \] - **Difference in green - word probabilities (DG)**: Used to measure the change in green - word probabilities between the watermarked model \( q \) and the original model \( p \) at the \( t \) - th time step: \[ \text{DG}_t(q_t) = \sum_{k \in \text{green}, k \in V} q_{t,k} - \sum_{k \in \text{green}, k \in V} p_{t,k} \] - **Optimization problem**: The main optimization problem in the paper can be expressed as: \[ \text{OPT}(\Delta) = \min_{\delta_{t,k}} \frac{1}{T} \sum_{t = 1}^T D_{\text{KL},t}(\delta_{t,1}, \ldots, \delta_{t,|V|}) \] Subject to the constraint: \[ \frac{1}{T} \sum_{t = 1}^T \text{DG}_t(\delta_{t,1}, \ldots, \delta_{t,|V|}) \geq \Delta \] Through these contributions, the paper provides an important theoretical and practical basis for understanding and designing more effective language - model watermarking techniques.