ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

Chenshuang Zhang,Fei Pan,Junmo Kim,In So Kweon,Chengzhi Mao
2024-03-28
Abstract:We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific type of evaluation over synthetic corruptions, backgrounds, and textures, yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work, we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models, we are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60\%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the robustness of neural networks in visual perception. Specifically, although existing robustness benchmark datasets such as ImageNet - C, ImageNet - 9 and Stylized - ImageNet can provide evaluations of specific types of synthetic interference, background and texture, these benchmarks are limited by specific changes and the quality of synthetic images is low. Therefore, the paper proposes a new benchmark dataset - ImageNet - D, which evaluates the robustness of deep models by using diffusion models to generate high - quality synthetic images with diverse backgrounds, textures and materials. ### Main contributions 1. **Generate high - quality synthetic images**: Use diffusion models to generate high - quality synthetic images with diverse backgrounds, textures and materials. These images are more challenging than those in existing benchmark datasets. 2. **Evaluate model robustness**: Verified by experiments, ImageNet - D can significantly reduce the accuracy of various visual models, including standard ResNet classifiers and the latest foundation models such as CLIP and MiniGPT - 4, with a maximum reduction of up to 60%. 3. **Propose the concept of shared perception failure**: Identify images with shared perception failure through multiple proxy models. These images can be reliably transferred to other untested models, further verifying the effectiveness of ImageNet - D. 4. **Human - annotated quality control**: Ensure that the generated images are valid, single - category and high - quality through manual annotation, making ImageNet - D suitable for evaluating the robustness of different neural networks. ### Experimental results - **Quantitative results**: ImageNet - D significantly reduces the accuracy of all tested models. In particular, for the latest large - scale foundation models such as LLaVa and MiniGPT - 4, the accuracy drops by 29.67% and 16.81% respectively. - **Visualization results**: Although humans can easily identify the main objects, models such as CLIP (ViT - L/14) and MiniGPT - 4 perform poorly on ImageNet - D and misclassification is obvious. - **Data augmentation and model architecture**: Existing data augmentation methods improve robustness on ImageNet - C, but their effects on ImageNet - D are not satisfactory, indicating that ImageNet - D is a necessary robustness evaluation benchmark. In conclusion, by introducing ImageNet - D, this paper provides a more comprehensive and challenging benchmark dataset for evaluating the robustness of neural networks in visual perception tasks.