Abstract:It is often assumed that the brain builds 3D coordinate frames, in retinal coordinates (with binocular disparity giving the third dimension), head-centred, body-centred and world-centred coordinates. This paper questions that assumption and begins to sketch an alternative based on, essentially, a set of reflexes. A 'policy network' is a term used in reinforcement learning to describe the set of actions that are generated by an agent depending on its current state. This is an untypical starting point for describing 3D vision, but a policy network can serve as a useful representation both for the 3D layout of a scene and the location of the observer within it. It avoids 3D reconstruction of the type used in computer vision but is similar to recent representations for navigation generated through reinforcement learning. A policy network for saccades (pure rotations of the camera/eye) is a logical starting point for understanding (i) an ego-centric representation of space (e.g. Marr's (Marr 1982 Vision: a computational investigation into the human representation and processing of visual information ) 2 12 -D sketch) and (ii) a hierarchical, compositional representation for navigation. The potential neural implementation of policy networks is straightforward; a network with a large range of sensory and task-related inputs such as the cerebellum would be capable of implementing this input/output function. This is not the case for 3D coordinate transformations in the brain: no neurally implementable proposals have yet been put forward that could carry out a transformation of a visual scene from retinal to world-based coordinates. Hence, if the representation underlying 3D vision can be described as a policy network (in which the actions are either saccades or head translations), this would be a significant step towards a neurally plausible model of 3D vision.

Learning Internal Representations of 3D Transformations from 2D Projected Inputs

Embedding Visual Cognition in 3D Reconstruction from Multi-View Engineering Drawings

Approaching human 3D shape perception with neurally mappable models

Invariant Object Recognition in the Visual System with Novel Views of 3D Objects

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

3D computational modeling and perceptual analysis of kinetic depth effects

Transformation Properties of Learned Visual Representations

Learning intermediate-level representations of form and motion from natural movies

Complexity of mental geometry for 3D pose perception

Learning to Reconstruct 3D Structure from Object Motion.

Learning Interpretable Dynamics from Images of a Freely Rotating 3D Rigid Body

A System View of the Recognition and Interpretation of Observed Human Shape, Pose and Action

Hierarchical 3D Perception from a Single Image

Neural Representations of Dynamic Visual Stimuli

Learning 3D object-centric representation through prediction

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Understanding 3D vision as a policy network

Unsupervised Learning of Visual 3D Keypoints for Control

Modelling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network

Hierarchical motion perception as causal inference

Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach