Abstract:A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

Do humans and machines have the same eyes? Human-machine perceptual differences on image classification

Seeing eye-to-eye? A comparison of object recognition performance in humans and deep convolutional neural networks under image manipulation

Inferring Human Vision in a Human-Like Way: Key Factors Influencing the Cognitive Processing of Level-1 Visual Perspective-Taking

Partial success in closing the gap between human and machine vision

Human Vs Machine Attention in Neural Networks: A Comparative Study.

Understanding More about Human and Machine Attention in Deep Neural Networks

Human-Machine Plan Conflict and Conflict Resolution in a Visual Search Task

Perception of Visual Content: Differences Between Humans and Foundation Models

When Does Perceptual Alignment Benefit Vision Representations?

Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap

Do better ImageNet classifiers assess perceptual similarity better?

ColorSense: A Study on Color Vision in Machine Visual Recognition

Human perception in computer vision

A psychophysics approach for quantitative comparison of interpretable computer vision models

A comparison between humans and AI at recognizing objects in unusual poses

Measuring and modeling the perception of natural and unconstrained gaze in humans and machines

When will AI misclassify? Intuiting failures on natural images

Perception of Image Features in Post-Mortem Iris Recognition: Humans vs Machines

Comparing Human and Machine Bias in Face Recognition

Comparing Facial Expression Recognition in Humans and Machines: Using CAM, GradCAM, and Extremal Perturbation

Do humans and Convolutional Neural Networks attend to similar areas during scene classification: Effects of task and image type