Abstract:The primate visual system achieves remarkable visual object recognition performance even in brief presentations and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations such as the amount of noise, the number of neural recording sites, and the number trials, and computational limitations such as the complexity of the decoding classifier and the number of classifier training examples. In this work we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of "kernel analysis" that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bio-inspired models, that possibility cannot be ruled out merely on representational performance grounds.

CortexNet: a Generic Network Family for Robust Visual Temporal Representations

TEINet: Towards an Efficient Architecture for Video Recognition.

Cortex Neural Network: learning with Neural Network groups

Connectivity-Inspired Network for Context-Aware Recognition

SynapNet: A Complementary Learning System Inspired Algorithm With Real-Time Application in Multimodal Perception

Beyond the Camera: Neural Networks in World Coordinates

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Clockwork Convnets for Video Semantic Segmentation

Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

A robust event-driven approach to always-on object recognition

Invariant Visual Object and Face Recognition: Neural and Computational Bases, and a Model, VisNet

Temporal-attentive Covariance Pooling Networks for Video Recognition

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Spatio-Temporal Adaptation in the Unsupervised Development of Networked Visual Neurons

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

A Sparse Coding Multi-Scale Precise-Timing Machine Learning Algorithm for Neuromorphic Event-Based Sensors

EDeNN: Event Decay Neural Networks for low latency vision