Abstract:How do newborns learn to see? We propose that visual systems are space-time fitters, meaning visual development can be understood as a blind fitting process (akin to evolution) in which visual systems gradually adapt to the spatiotemporal data distributions in the newborn's environment. To test whether space-time fitting is a viable theory for learning how to see, we performed parallel controlled-rearing experiments on newborn chicks and deep neural networks (DNNs), including CNNs and transformers. First, we raised newborn chicks in impoverished environments containing a single object, then simulated those environments in a video game engine. Second, we recorded first-person images from agents moving through the virtual animal chambers and used those images to train DNNs. Third, we compared the viewpoint-invariant object recognition performance of the chicks and DNNs. When DNNs received the same visual diet (training data) as chicks, the models developed common object recognition skills as chicks. DNNs that used time as a teaching signal—space-time fitters—also showed common patterns of successes and failures across the test viewpoints as chicks. Thus, DNNs can learn object recognition in the same impoverished environments as newborn animals. We argue that space-time fitters can serve as formal scientific models of newborn visual systems, providing image-computable models for studying how newborns learn to see from raw visual experiences. Do machines learn like brains? The performance of all learning systems depends on both the learning machinery and experiences (training data) from which the system learns, so answering this question will require giving machines and brains the same training data. To do so, we introduce a digital twin method for running parallel controlled-rearing studies on newborn animals and deep neural networks. We show that when deep neural networks (CNNs and transformers) are trained in the same visual environments as newborn chicks, the models develop the same object recognition skills as chicks. Both newborn chicks and deep neural networks can learn invariant object representations that generalize across novel viewpoints, even when learning occurs in an impoverished environment containing a single object seen from a limited 60° viewpoint range. Our study shows that blind fitting processes (variation + selection learning) can mimic the rapid visual learning of precocial newborn animals, in the absence of innate (hardcoded) knowledge about objects or space. We argue that visual development can be understood as space-time fitting, in which visual systems gradually adapt to the spatiotemporal data distributions in the environment.

Embodied vision for learning object representations

A Computational Account Of Self-Supervised Visual Learning From Egocentric Object Play

Embodied Object Representation Learning and Recognition

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Active Object Manipulation Facilitates Visual Object Learning: An Egocentric Vision Study

Parallel development of object recognition in newborn chicks and deep neural networks

Learning 3D object-centric representation through prediction

A Computational Model of Early Word Learning from the Infant's Point of View

Online Object Representations with Contrastive Learning

Learning Object Semantic Similarity with Self-Supervision

Learning task-agnostic representation via toddler-inspired learning

Playful Interactions for Representation Learning

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Visual Experience Acquisition Based On View Angle Estimation From 3d Monocular Image

Fast and robust visual object recognition in young children

Learning visual object models on a robot using context and appearance cues

Invariant Object Recognition in the Visual System with Novel Views of 3D Objects

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

Learning Invariant Object Recognition in the Visual System with Continuous Transformations

Learning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning