Abstract:How do newborns learn to see? We propose that visual systems are space-time fitters, meaning visual development can be understood as a blind fitting process (akin to evolution) in which visual systems gradually adapt to the spatiotemporal data distributions in the newborn's environment. To test whether space-time fitting is a viable theory for learning how to see, we performed parallel controlled-rearing experiments on newborn chicks and deep neural networks (DNNs), including CNNs and transformers. First, we raised newborn chicks in impoverished environments containing a single object, then simulated those environments in a video game engine. Second, we recorded first-person images from agents moving through the virtual animal chambers and used those images to train DNNs. Third, we compared the viewpoint-invariant object recognition performance of the chicks and DNNs. When DNNs received the same visual diet (training data) as chicks, the models developed common object recognition skills as chicks. DNNs that used time as a teaching signal—space-time fitters—also showed common patterns of successes and failures across the test viewpoints as chicks. Thus, DNNs can learn object recognition in the same impoverished environments as newborn animals. We argue that space-time fitters can serve as formal scientific models of newborn visual systems, providing image-computable models for studying how newborns learn to see from raw visual experiences. Do machines learn like brains? The performance of all learning systems depends on both the learning machinery and experiences (training data) from which the system learns, so answering this question will require giving machines and brains the same training data. To do so, we introduce a digital twin method for running parallel controlled-rearing studies on newborn animals and deep neural networks. We show that when deep neural networks (CNNs and transformers) are trained in the same visual environments as newborn chicks, the models develop the same object recognition skills as chicks. Both newborn chicks and deep neural networks can learn invariant object representations that generalize across novel viewpoints, even when learning occurs in an impoverished environment containing a single object seen from a limited 60° viewpoint range. Our study shows that blind fitting processes (variation + selection learning) can mimic the rapid visual learning of precocial newborn animals, in the absence of innate (hardcoded) knowledge about objects or space. We argue that visual development can be understood as space-time fitting, in which visual systems gradually adapt to the spatiotemporal data distributions in the environment.

A Computational Model of Early Word Learning from the Infant's Point of View

A model of early word acquisition based on realistic-scale audiovisual naming events

A computational model of early language acquisition from audiovisual experiences of young infants

Computational Baby Learning

Predicting Word Learning in Children from the Performance of Computer Vision Systems

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Embodied vision for learning object representations

Biologically Inspired Model for Visual Cognition Achieving Unsupervised Episodic and Semantic Feature Learning.

Active Object Manipulation Facilitates Visual Object Learning: An Egocentric Vision Study

A Computational Account Of Self-Supervised Visual Learning From Egocentric Object Play

Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection

Grounded language acquisition through the eyes and ears of a single child

Parallel development of object recognition in newborn chicks and deep neural networks

Understanding Early Word Learning in Situated Artificial Agents

Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning

A Semi-Automated Method for Object Segmentation in Infant's Egocentric Videos to Study Object Perception

Towards early prediction of neurodevelopmental disorders: Computational model for Face Touch and Self-adaptors in Infants

Moving beyond "nouns in the lab": Using naturalistic data to understand why infants' first words include uh-oh and hi

An Autonomous Developmental Cognitive Architecture Based on Incremental Associative Neural Network with Dynamic Audiovisual Fusion

Learning 3D object-centric representation through prediction

Spatial relation categorization in infants and deep neural networks