A Certified Robust Watermark For Large Language Models

Xianheng Feng,Jian Liu,Kui Ren,Chun Chen
2024-09-29
Abstract:The effectiveness of watermark algorithms in AI-generated text identification has garnered significant attention. Concurrently, an increasing number of watermark algorithms have been proposed to enhance the robustness against various watermark attacks. However, these watermark algorithms remain susceptible to adaptive or unseen attacks. To address this issue, to our best knowledge, we propose the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text. Specifically, we utilize two different models respectively for watermark generation and detection and add Gaussian and Uniform noise respectively in the embedding and permutation space during the training and inference stages of the watermark detector to enhance the certified robustness of our watermark detector and derive certified radius. To evaluate the empirical robustness and certified robustness of our watermark algorithm, we conducted comprehensive experiments. The results indicate that our watermark algorithm shows comparable performance to baseline algorithms while our algorithm can derive substantial certified robustness, which means that our watermark can not be removed even under significant alterations.
Cryptography and Security
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of the vulnerability of watermarking algorithms in the text generated by large language models (LLMs) when facing various watermarking attacks. Although the existing watermarking algorithms show certain robustness under some specific attacks, they are still vulnerable to adaptive attacks or unknown attacks. For this reason, the author proposes the first certified robust watermarking algorithm based on randomized smoothing to provide provable robustness guarantees for watermarked text. ### Specific problem description 1. **Limitations of existing watermarking algorithms**: - Although the existing watermarking algorithms perform well under some specific attacks, they are still vulnerable when facing adaptive attacks or unknown attacks. - For example, the watermarking framework proposed by Kirchenbauer et al. performs poorly when facing attacks such as text synonym replacement. - The semantically - invariant robust watermarking scheme proposed by Liu et al. requires user input prompts, which is impractical in real - world scenarios. - The fixed green token list method proposed by Zhao et al. is weak in terms of anti - forgery. 2. **Requirement for certified robustness**: - In order to deal with unknown attacks, providing provable robustness guarantees is an effective solution. - In the fields of image and text classification, some works have already provided provable robustness guarantees through randomized smoothing techniques. - However, currently, there is no certified robust watermarking algorithm based on randomized smoothing applied to large language models. ### Solutions 1. **Certified robust watermarking algorithm based on randomized smoothing**: - The author proposes the first certified robust watermarking algorithm based on randomized smoothing. This algorithm can add Gaussian noise and uniform noise in the embedding space and permutation space to enhance the certified robustness of the watermark detector. - By adding noise in the embedding space and permutation space during the training and inference stages respectively, the algorithm can provide provable robustness guarantees and derive the certified radius. 2. **Experimental verification**: - The author conducts extensive experiments to evaluate the empirical robustness and certified robustness of the algorithm under different watermarking attacks. - The experimental results show that the algorithm exhibits performance comparable to or even better than the baseline algorithms under various watermarking attacks, and has significant certified robustness. Even after the text has been significantly modified, the watermark is difficult to be removed. ### Main contributions 1. **Propose the first certified robust watermarking algorithm**: - This algorithm is the first certified robust watermarking algorithm based on randomized smoothing, which can provide provable robustness guarantees in the text generated by large language models. 2. **Introduce randomized smoothing techniques**: - By adding Gaussian noise and uniform noise in the embedding space and permutation space respectively, the algorithm can effectively enhance the robustness of the watermark. 3. **Extensive experimental verification**: - Through a large number of experiments, the performance of the algorithm under various watermarking attacks is verified, and its advantages in terms of certified robustness are demonstrated. ### Conclusion The paper effectively solves the vulnerability problem of existing watermarking algorithms when facing unknown attacks by proposing a certified robust watermarking algorithm based on randomized smoothing, providing a new solution for watermark detection in the text generated by large language models.