Abstract:We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific type of evaluation over synthetic corruptions, backgrounds, and textures, yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work, we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models, we are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60\%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the robustness of neural networks in visual perception. Specifically, although existing robustness benchmark datasets such as ImageNet - C, ImageNet - 9 and Stylized - ImageNet can provide evaluations of specific types of synthetic interference, background and texture, these benchmarks are limited by specific changes and the quality of synthetic images is low. Therefore, the paper proposes a new benchmark dataset - ImageNet - D, which evaluates the robustness of deep models by using diffusion models to generate high - quality synthetic images with diverse backgrounds, textures and materials. ### Main contributions 1. **Generate high - quality synthetic images**: Use diffusion models to generate high - quality synthetic images with diverse backgrounds, textures and materials. These images are more challenging than those in existing benchmark datasets. 2. **Evaluate model robustness**: Verified by experiments, ImageNet - D can significantly reduce the accuracy of various visual models, including standard ResNet classifiers and the latest foundation models such as CLIP and MiniGPT - 4, with a maximum reduction of up to 60%. 3. **Propose the concept of shared perception failure**: Identify images with shared perception failure through multiple proxy models. These images can be reliably transferred to other untested models, further verifying the effectiveness of ImageNet - D. 4. **Human - annotated quality control**: Ensure that the generated images are valid, single - category and high - quality through manual annotation, making ImageNet - D suitable for evaluating the robustness of different neural networks. ### Experimental results - **Quantitative results**: ImageNet - D significantly reduces the accuracy of all tested models. In particular, for the latest large - scale foundation models such as LLaVa and MiniGPT - 4, the accuracy drops by 29.67% and 16.81% respectively. - **Visualization results**: Although humans can easily identify the main objects, models such as CLIP (ViT - L/14) and MiniGPT - 4 perform poorly on ImageNet - D and misclassification is obvious. - **Data augmentation and model architecture**: Existing data augmentation methods improve robustness on ImageNet - C, but their effects on ImageNet - D are not satisfactory, indicating that ImageNet - D is a necessary robustness evaluation benchmark. In conclusion, by introducing ImageNet - D, this paper provides a more comprehensive and challenging benchmark dataset for evaluating the robustness of neural networks in visual perception tasks.

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking

Benchmarking Robustness to Text-Guided Corruptions

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness

Improving Adversarial Robustness by Contrastive Guided Diffusion Process

IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

Adversarial Robustification via Text-to-Image Diffusion Models

Impact of Light and Shadow on Robustness of Deep Neural Networks

Diffusion Models Need Visual Priors for Image Generation

Are Diffusion Models Vision-And-Language Reasoners?

Impact of Scaled Image on Robustness of Deep Neural Networks

RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

ESTIMATE OF THE MAXIMUM SUSTAINABLE YIELD OF SERGESTID SHRIMP IN THE WATERS OFF SOUTHWESTERN TAIWAN

ContRE: A Complementary Measure for Robustness Evaluation of Deep Networks Via Contrastive Examples

DIRE for Diffusion-Generated Image Detection

Robust Classification via a Single Diffusion Model

Leaving Reality to Imagination: Robust Classification via Generated Datasets