Abstract:View-invariant object recognition is a challenging problem, which has attracted much attention among the psychology, neuroscience, and computer vision communities. Humans are notoriously good at it, even if some variations are presumably more difficult to handle than others (e.g. 3D rotations). Humans are thought to solve the problem through hierarchical processing along the ventral stream, which progressively extracts more and more invariant visual features. This feed-forward architecture has inspired a new generation of bio-inspired computer vision systems called deep convolutional neural networks (DCNN), which are currently the best algorithms for object recognition in natural images. Here, for the first time, we systematically compared human feed-forward vision and DCNNs at view-invariant object recognition using the same images and controlling for both the kinds of transformation as well as their magnitude. We used four object categories and images were rendered from 3D computer models. In total, 89 human subjects participated in 10 experiments in which they had to discriminate between two or four categories after rapid presentation with backward masking. We also tested two recent DCNNs on the same tasks. We found that humans and DCNNs largely agreed on the relative difficulties of each kind of variation: rotation in depth is by far the hardest transformation to handle, followed by scale, then rotation in plane, and finally position. This suggests that humans recognize objects mainly through 2D template matching, rather than by constructing 3D object models, and that DCNNs are not too unreasonable models of human feed-forward vision. Also, our results show that the variation levels in rotation in depth and scale strongly modulate both humans' and DCNNs' recognition performances. We thus argue that these variations should be controlled in the image datasets used in vision research.

Shape-selective processing in deep networks: integrating the evidence on perceptual integration

Teaching deep networks to see shape: Lessons from a simplified visual world

Visual Complexity of Shapes: a Hierarchical Perceptual Learning Model

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Seeing eye-to-eye? A comparison of object recognition performance in humans and deep convolutional neural networks under image manipulation

Do DNNs trained on Natural Images acquire Gestalt Properties?

Shape-Based Measures Improve Scene Categorization

Disentangling neural mechanisms for perceptual grouping

On the Influence of Shape, Texture and Color for Learning Semantic Segmentation

Approaching human 3D shape perception with neurally mappable models

Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

Shape-Biased Learning by Thinking Inside the Box

Learning a model of shape selectivity in V4 cells reveals shape encoding mechanisms in the brain

Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity

Learning compact generalizable neural representations supporting perceptual grouping

Understanding Deep Convolutional Networks through Gestalt Theory

Brain-like emergent properties in deep networks: impact of network architecture, datasets and training

ShapeGlot: Learning Language for Shape Differentiation

High-level aftereffects reveal the role of statistical features in visual shape encoding

Humans and deep networks largely agree on which kinds of variation make object recognition harder