Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors

Sina Mavali,Jonas Ricker,David Pape,Yash Sharma,Asja Fischer,Lea Schönherr
2024-10-03
Abstract:While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper aims to address the robustness issue of AI-generated image detectors when facing real-world attacks. Specifically, the paper focuses on the performance of current AI-generated image (AIGI) detectors when encountering attackers, especially in scenarios where the attacker lacks specific information about the target model and the images have undergone post-processing on social media platforms. The paper points out that although existing AIGI detectors perform well under ideal conditions, they exhibit significant deficiencies in robustness and reliability when facing the risk of maliciously generated media in the real world. ### Main Research Questions: 1. **Robustness Evaluation of Detectors**: The paper evaluates the robustness of state-of-the-art AIGI detectors under different attack scenarios, including white-box attacks (where the attacker has complete information about the target model) and black-box attacks (where the attacker uses surrogate models to generate adversarial examples). 2. **Effectiveness of Real-World Attacks**: The study investigates the effectiveness of attacks after images have undergone common post-processing operations (such as compression, blurring, adding noise, etc.), which are very common on social media platforms. 3. **Proposing Defense Mechanisms**: A simple defense mechanism is proposed to enhance the robustness of CLIP-based detectors, allowing them to maintain high detection accuracy when facing various attacks. ### Research Background: - **Development of Generative Models**: The paper discusses the latest advancements in generative models (such as Diffusion Models, DMs), which can generate highly realistic images that are difficult for humans to distinguish from real ones. - **Detection Methods**: Current detection methods are mainly divided into three categories: methods using high-level features, methods based on low-level features, and data-driven methods. Among them, data-driven methods based on pre-trained models (such as CLIP) show excellent generalization ability and robustness. - **Adversarial Examples**: Adversarial examples refer to images that cause the detector to misclassify by making small perturbations to the original image. The paper details several common adversarial attack methods, such as FGSM, BIM, and PGD, and explores the effects of these methods under different norm constraints. ### Experimental Setup: - **Datasets**: Two datasets were used, GenImage for training the detectors and Synthbuster for testing and evaluation. - **Detectors**: Three of the latest AIGI detectors (Grag, UnivFD, and DRCT) were selected, and some models were retrained on images generated by Stable Diffusion 1.4. - **Attack Methods**: Various adversarial attack methods (BIM, PGD, FGSM) were used, and experiments were conducted under L2 and L∞ norms. - **Evaluation Metrics**: Accuracy, AUC ROC, and TPR@5%FPR were used to evaluate the performance of the detectors and the effectiveness of adversarial attacks. ### Main Findings: - **Baseline Performance**: In benign conditions without attacks, CLIP-based detectors (such as UnivFD* and DRCT-CLIP) significantly outperformed CNN-based detectors in terms of generalization ability. - **Effectiveness of Adversarial Attacks**: Even after post-processing, simple adversarial attacks can significantly reduce the accuracy of detectors, even bringing the accuracy down to 0%. - **Effectiveness of Defense Mechanisms**: The proposed defense mechanism can effectively enhance the robustness of detectors under both white-box and black-box attack scenarios, maintaining high detection accuracy even under high perturbation conditions. ### Conclusion: The paper reveals the vulnerability of current AIGI detectors when facing real-world attacks and proposes effective defense measures. These findings are of great significance for improving the reliability and security of detectors in real-world applications.