Abstract:Deep neural networks (DNNs) are promising models of the cortical computations supporting human object recognition. However, despite their ability to explain a significant portion of variance in neural data, the agreement between models and brain representational dynamics is far from perfect. We address this issue by asking which representational features are currently unaccounted for in neural time series data, estimated for multiple areas of the ventral stream via source-reconstructed magnetoencephalography data acquired in human participants (nine females, six males) during object viewing. We focus on the ability of visuo-semantic models, consisting of human-generated labels of object features and categories, to explain variance beyond the explanatory power of DNNs alone. We report a gradual reversal in the relative importance of DNN versus visuo-semantic features as ventral-stream object representations unfold over space and time. Although lower-level visual areas are better explained by DNN features starting early in time (at 66 ms after stimulus onset), higher-level cortical dynamics are best accounted for by visuo-semantic features starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. These results show that a significant component of the variance unexplained by DNNs in higher-level cortical dynamics is structured and can be explained by readily nameable aspects of the objects. We conclude that current DNNs fail to fully capture dynamic representations in higher-level human visual cortex and suggest a path toward more accurate models of ventral-stream computations.SIGNIFICANCE STATEMENT When we view objects such as faces and cars in our visual environment, their neural representations dynamically unfold over time at a millisecond scale. These dynamics reflect the cortical computations that support fast and robust object recognition. DNNs have emerged as a promising framework for modeling these computations but cannot yet fully account for the neural dynamics. Using magnetoencephalography data acquired in human observers during object viewing, we show that readily nameable aspects of objects, such as 'eye', 'wheel', and 'face', can account for variance in the neural dynamics over and above DNNs. These findings suggest that DNNs and humans may in part rely on different object features for visual recognition and provide guidelines for model improvement.

Using deep neural networks to disentangle visual and semantic information in human perception and memory

Deep learning algorithms reveal a new visual-semantic representation of familiar faces in human perception and memory

Disentangled deep generative models reveal coding principles of the human face processing network

Seeing eye-to-eye? A comparison of object recognition performance in humans and deep convolutional neural networks under image manipulation

Unsupervised deep learning identifies semantic disentanglement in single inferotemporal neurons

Bridging the Semantic Latent Space Between Brain and Machine: Similarity is All You Need

Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processing pathway

Multimodal deep neural decoding reveals highly resolved spatiotemporal profile of visual object representation in humans

Towards flexible perception with visual memory

Using drawings and deep neural networks to characterize the building blocks of human visual similarity

Text-related functionality of visual human pre-frontal activations revealed through neural network convergence

Concurrent emergence of view invariance, sensitivity to critical features, and identity face classification through visual experience: Insights from deep learning algorithms

A Novel Biologically Inspired Visual Cognition Model: Automatic Extraction of Semantics, Formation of Integrated Concepts, and Reselection Features for Ambiguity

Multi-Semantic Decoding of Visual Perception with Graph Neural Networks

Deep Neural Networks and Visuo-Semantic Models Explain Complementary Components of Human Ventral-Stream Representational Dynamics

Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain

Semantic Relatedness Emerges in Deep Convolutional Neural Networks Designed for Object Recognition

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Human EEG and artificial neural networks reveal disentangled representations of object real-world size in natural images

Semantic Content in Face Representation: Essential for Proficient Recognition of Unfamiliar Faces by Good Recognizers

Visual Neural Decoding via Improved Visual-EEG Semantic Consistency