Abstract:A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

Computer Vision : History , the Rise of Deep Networks , and Future Vistas Panel on Perception and Cognition , MORS Meeting on Artificial Intelligence and Autonomy

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition

Seeing eye-to-eye? A comparison of object recognition performance in humans and deep convolutional neural networks under image manipulation

Convolutional networks and applications in vision

Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

A comparison between humans and AI at recognizing objects in unusual poses

Towards flexible perception with visual memory

Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition

Progress and limitations of deep networks to recognize objects in unusual poses

Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Deep Learning Using Isotroping, Laplacing, Eigenvalues Interpolative Binding, and Convolved Determinants with Normed Mapping for Large-Scale Image Retrieval

Partial success in closing the gap between human and machine vision

Deep Nets: What have they ever done for Vision?

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Revisiting Sparse Convolutional Model for Visual Recognition

Connectivity-Inspired Network for Context-Aware Recognition

Data-Efficient Image Recognition with Contrastive Predictive Coding

Rethinking the Inception Architecture for Computer Vision

Brain inspired Robust Vision using Convolutional Neural Networks with Feedback