WaterPark: A Robustness Assessment of Language Model Watermarking

Jiacheng Liang,Zian Wang,Lauren Hong,Shouling Ji,Ting Wang
2024-11-21
Abstract:To mitigate the misuse of large language models (LLMs), such as disinformation, automated phishing, and academic cheating, there is a pressing need for the capability of identifying LLM-generated texts. Watermarking emerges as one promising solution: it plants statistical signals into LLMs' generative processes and subsequently verifies whether LLMs produce given texts. Various watermarking methods (``watermarkers'') have been proposed; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. For instance, a watermarker's resilience to increasingly intensive attacks hinges on its context dependency. We further explore the best practices to operate watermarkers in adversarial environments. For instance, using a generic detector alongside a watermark-specific detector improves the security of vulnerable watermarkers. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the abuse of content generated by large - language models (LLMs), such as false information, auto - phishing, and academic cheating. To alleviate these problems, the paper explores the ability to identify LLM - generated texts through watermarking techniques. Specifically, the paper focuses on the following key issues: 1. **Advantages and limitations of different watermarking methods**: In particular, how robust are they against attacks? 2. **How do different design choices affect the robustness of watermarking methods?** 3. **How to optimally operate watermarking methods in an adversarial environment?** To answer these questions, the paper systematizes the existing LLM watermarking methods and watermark - removal attacks and constructs a unified evaluation platform - WATER PARK. This platform integrates 10 state - of - the - art watermarking methods and 12 representative attacks. Through WATER PARK, a comprehensive evaluation of the existing watermarking methods is carried out, revealing the influence of different design choices on their robustness. For example, the resistance of watermarking methods to increasingly intense attacks depends on their context - dependence. In addition, the paper also explores the best practices for operating watermarking methods in an adversarial environment. For example, the combination of a general - purpose detector and a specific watermark detector can improve the security of fragile watermarking methods. In conclusion, this paper aims to fill the gaps in current research and provide a valuable test platform to promote future research.