Abstract:Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (<a class="link-https" data-arxiv-id="2102.01558" href="https://arxiv.org/abs/2102.01558">arXiv:2102.01558</a>). IRUO utilizes real-world and artificially occluded images to test and benchmark leading methods' robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through "holes" in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

Partial success in closing the gap between human and machine vision

Seeing eye-to-eye? A comparison of object recognition performance in humans and deep convolutional neural networks under image manipulation

Aligning Machine and Human Visual Representations across Abstraction Levels

A comparison between humans and AI at recognizing objects in unusual poses

Humans and deep networks largely agree on which kinds of variation make object recognition harder

Paid Living Donation and Growth of Deceased Donor Programs.

Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises

Human Eyes-Inspired Recurrent Neural Networks Are More Robust Against Adversarial Noises

Brain inspired Robust Vision using Convolutional Neural Networks with Feedback

Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap

Do humans and machines have the same eyes? Human-machine perceptual differences on image classification

Performance-optimized deep neural networks are evolving into worse models of inferotemporal visual cortex

Computer Vision : History , the Rise of Deep Networks , and Future Vistas Panel on Perception and Cognition , MORS Meeting on Artificial Intelligence and Autonomy

Extreme Image Transformations Affect Humans and Machines Differently

Evaluating (and Improving) the Correspondence Between Deep Neural Networks and Human Representations

Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?

Human perception in computer vision

The human visual system and CNNs can both support robust online translation tolerance following extreme displacements

Progress and limitations of deep networks to recognize objects in unusual poses

Dissociable Neural Representations of Adversarially Perturbed Images in Convolutional Neural Networks and the Human Brain

Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms