Abstract:Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the robustness of watermark detection against evasion attacks in the no - box setting. Specifically, the researchers focus on how to use transfer attacks to remove watermarks in images and thus evade the detection of AI - generated images when the attacker has no access to the target watermark model nor the detection API.
### Background of the Paper and Problem Definition
With the development of generative AI technology, synthesized images are becoming more and more realistic, which poses a challenge to the authenticity of information on the Internet. In order to distinguish between AI - generated content and non - AI - generated content, watermarking techniques are widely used in the industry. For example, Google embeds SynthID watermarks in the images generated by Imagen; OpenAI embeds watermarks in the images generated by DALL - E; Stable Diffusion also allows users to embed watermarks in the generated images.
However, attackers can use evasion attacks to remove watermarks in images to evade detection. This attack adds small perturbations to the watermarked image, causing the target watermark detector to misidentify the perturbed image as non - AI - generated. The existing literature has a good understanding of the robustness of watermark detection in white - box (the attacker has access to the target watermark model) and black - box (the attacker has access to the detection API) settings, but there is less research on robustness in the no - box setting.
### Main Contributions of the Paper
1. **Proposing a new transfer - based attack method**:
- This method uses multiple surrogate watermarking models to generate perturbations in the no - box setting, and these perturbations can disable the target watermark model.
- Different from the existing classifier - based transfer attacks and transfer attacks with a single surrogate watermarking model, this method directly uses multiple surrogate watermarking models and is more suitable for watermark detection.
2. **Theoretically analyzing the effectiveness of the attack**:
- The researchers quantified the correlation between the target watermark model and the surrogate watermarking models, and deduced the probability that the watermark decoded by the target watermark model is flipped after adding perturbations.
- Based on this probability, the upper and lower bounds of the probability that the watermark decoded by the target watermark model matches the real watermark are further deduced, thereby quantifying the transferability of the attack.
3. **Empirically evaluating the effect of the attack**:
- This method was tested on image datasets generated by Stable Diffusion and Midjourney. The results show that even if the surrogate watermarking models use different algorithms, neural network architectures and watermark lengths, and the distribution of the training dataset is different from that of the target model, this method can still successfully evade watermark detection while maintaining image quality.
- The experimental results also show that this method is significantly better than common post - processing methods, existing transfer attack methods and the latest adversarial sample purification methods.
### Conclusion
This paper shows that in the no - box setting, the transfer - based attack method can effectively evade watermark detection, which poses new challenges to the existing watermarking techniques. The researchers not only proposed a new attack method, but also theoretically analyzed its effectiveness and verified its performance through experiments. This research provides an important reference for future watermarking techniques and adversarial attacks.