Learning 3D object-centric representation through prediction

John Day,Tushar Arora,Jirui Liu,Li Erran Li,Ming Bo Cai

2024-03-06

Abstract:As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we developed a novel network architecture that simultaneously learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth, all while using only information directly available to the brain as training data, namely: sequences of images and self-motion. The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes. This results in object representations being learned as an essential byproduct of learning to predict.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### The Problems This Paper Attempts to Solve This paper aims to address the following issues: 1. **Unsupervised Learning of 3D Object Representation**: Humans can perceive objects in a three-dimensional environment without supervision, but there is currently a lack of models that can learn the same ability under similar conditions (i.e., without labeled data). The paper proposes a new network architecture that can simultaneously learn to segment objects, infer their 3D positions, and perceive depth using only image sequences and self-motion information. 2. **Efficient Prediction of Future Scenes**: Assuming that objects are the latent causes of visual input, the brain uses these representations to efficiently predict future scenes. Therefore, by learning to predict future images, object representations can be learned as a byproduct. 3. **Overcoming Limitations of Existing Models**: Most existing object-centric representation learning (OCRL) models can only solve part of the brain's perception tasks, and many models require pre-training or use additional information that the brain cannot directly obtain (such as depth, optical flow, and object bounding boxes). The proposed model OPPLE attempts to achieve object segmentation, depth perception, and 3D localization through unsupervised learning using only image sequences and self-motion information. Through the above methods, this paper attempts to bridge the gap between current computer vision models and human perceptual abilities, especially in the area of unsupervised learning.

Learning 3D object-centric representation through prediction

Learning Object Spatial Relationship from Demonstration

Learning to Predict the 3D Layout of a Scene

Learning Physical Dynamics for Object-centric Visual Prediction

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud

Embodied vision for learning object representations

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Object-centric Video Representation for Long-term Action Anticipation

CoCoNets: Continuous Contrastive 3D Scene Representations

Object Pursuit: Building a Space of Objects Via Discriminative Weight Generation

Embodied Object Representation Learning and Recognition

Exploring Hierarchical Spatial Layout Cues for 3D Point Cloud based Scene Graph Prediction

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

Variational Inference for Scalable 3D Object-centric Learning

Learning Features by Watching Objects Move

Spatial Relation Learning in Complementary Scenarios with Deep Neural Networks.

Hierarchical 3D Perception from a Single Image

Deep Predictive Learning in Neocortex and Pulvinar

Learning Object-Centric Representation via Reverse Hierarchy Guidance

Learning 3D Object Shape and Layout without 3D Supervision