Abstract:How do newborns learn to see? We propose that visual systems are space-time fitters, meaning visual development can be understood as a blind fitting process (akin to evolution) in which visual systems gradually adapt to the spatiotemporal data distributions in the newborn's environment. To test whether space-time fitting is a viable theory for learning how to see, we performed parallel controlled-rearing experiments on newborn chicks and deep neural networks (DNNs), including CNNs and transformers. First, we raised newborn chicks in impoverished environments containing a single object, then simulated those environments in a video game engine. Second, we recorded first-person images from agents moving through the virtual animal chambers and used those images to train DNNs. Third, we compared the viewpoint-invariant object recognition performance of the chicks and DNNs. When DNNs received the same visual diet (training data) as chicks, the models developed common object recognition skills as chicks. DNNs that used time as a teaching signal—space-time fitters—also showed common patterns of successes and failures across the test viewpoints as chicks. Thus, DNNs can learn object recognition in the same impoverished environments as newborn animals. We argue that space-time fitters can serve as formal scientific models of newborn visual systems, providing image-computable models for studying how newborns learn to see from raw visual experiences. Do machines learn like brains? The performance of all learning systems depends on both the learning machinery and experiences (training data) from which the system learns, so answering this question will require giving machines and brains the same training data. To do so, we introduce a digital twin method for running parallel controlled-rearing studies on newborn animals and deep neural networks. We show that when deep neural networks (CNNs and transformers) are trained in the same visual environments as newborn chicks, the models develop the same object recognition skills as chicks. Both newborn chicks and deep neural networks can learn invariant object representations that generalize across novel viewpoints, even when learning occurs in an impoverished environment containing a single object seen from a limited 60° viewpoint range. Our study shows that blind fitting processes (variation + selection learning) can mimic the rapid visual learning of precocial newborn animals, in the absence of innate (hardcoded) knowledge about objects or space. We argue that visual development can be understood as space-time fitting, in which visual systems gradually adapt to the spatiotemporal data distributions in the environment.

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

Parallel development of object recognition in newborn chicks and deep neural networks

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

A newborn embodied Turing test for view-invariant object recognition

Controlled-rearing studies of newborn chicks and deep neural networks

Do Vision Transformers See Like Convolutional Neural Networks?

Training Vision Transformers with only 2040 Images.

Effective Vision Transformer Training: A Data-Centric Perspective

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Embodied vision for learning object representations

Super Vision Transformer

Digital Twin Studies for Reverse Engineering the Origins of Visual Intelligence

ViR: Towards Efficient Vision Retention Backbones

Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective

Vision Transformers for Mobile Applications: A Short Survey

Three things everyone should know about Vision Transformers

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Large Vision Models: How Transformer-based Models excelled over Traditional Deep Learning Architectures in Video Processing

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

How to Train Vision Transformer on Small-scale Datasets?