Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors

Anisha Pal,Julia Kruk,Mansi Phute,Manognya Bhattaram,Diyi Yang,Duen Horng Chau,Judy Hoffman
2024-11-12
Abstract:Text-to-image diffusion models have impactful applications in art, design, and entertainment, yet these technologies also pose significant risks by enabling the creation and dissemination of misinformation. Although recent advancements have produced AI-generated image detectors that claim robustness against various augmentations, their true effectiveness remains uncertain. Do these detectors reliably identify images with different levels of augmentation? Are they biased toward specific scenes or data distributions? To investigate, we introduce SEMI-TRUTHS, featuring 27,600 real images, 223,400 masks, and 1,472,700 AI-augmented images that feature targeted and localized perturbations produced using diverse augmentation techniques, diffusion models, and data distributions. Each augmented image is accompanied by metadata for standardized and targeted evaluation of detector robustness. Our findings suggest that state-of-the-art detectors exhibit varying sensitivities to the types and degrees of perturbations, data distributions, and augmentation methods used, offering new insights into their performance and limitations. The code for the augmentation and evaluation pipeline is available at <a class="link-external link-https" href="https://github.com/J-Kruk/SemiTruths" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issues of effectiveness and robustness of existing AI-generated image detectors when faced with images enhanced to varying degrees and types. Specifically: 1. **Detector Effectiveness**: Can existing AI-generated image detectors reliably identify images that have been enhanced to different extents? 2. **Detector Bias**: Do these detectors exhibit bias towards specific scenes or data distributions? To investigate these issues, the authors introduce a large-scale dataset named SEMI-TRUTHS, which includes 27,600 real images, 223,400 masks, and 1,472,700 AI-enhanced images. These images are generated through various enhancement techniques, diffusion models, and data distributions, featuring targeted and localized perturbations. Each enhanced image is accompanied by metadata to standardize and specifically evaluate the robustness of the detectors. ### Main Contributions 1. **Large-Scale Dataset**: The SEMI-TRUTHS dataset provides diverse image enhancements, including different sizes of enhancement areas (Area Ratio) and varying degrees of semantic changes (Semantic Magnitude), categorized into small, medium, and large levels. 2. **Detailed Metadata**: Each enhanced image comes with detailed metadata, including original data distribution, enhancement methods, diffusion models, and more. 3. **Flexible Framework**: A flexible, pluggable framework is provided, supporting unsupervised image editing, adaptable to new data distributions, large language models, diffusion models, and various image synthesis techniques. 4. **Performance Evaluation**: By evaluating 6 state-of-the-art AI-generated image detectors, the paper reveals their sensitivity to different data distributions, diffusion models, and perturbation levels, offering new insights. ### Experimental Results The experimental results show that existing detectors exhibit varying sensitivity when handling different types of enhanced images. For example: - **CrossEfficientViT** shows significant performance degradation when processing face images from benchmarks like ADE20K, CityScapes, and SUN-RGBD. - **DE-FAKE** performs the worst when handling face-centric images from datasets like CelebA-HQ and HumanParsing. - **UniversalFakeDetect** experiences performance drops in complex and multi-instance scenarios. These results highlight the high sensitivity of detectors to the semantic properties of data distributions, underscoring the importance of stress testing to identify and address distribution weaknesses.