Abstract:Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called ``object constancy'' across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.

What problem does this paper attempt to address?

The paper attempts to address the problem of unsupervised learning of compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints. Specifically, the authors propose a new problem setting: learning object-centric representations from multiple unknown and unrelated viewpoints without any supervision (including viewpoint annotations) and achieving object constancy. ### Main Problems 1. **Object Constancy**: How to achieve object constancy, i.e., recognizing the same object from different viewpoints without guidance from viewpoint information. 2. **Representation Decoupling**: How to decouple image representations into object-centric representations (viewpoint-independent parts) and viewpoint-related representations, even when there are infinitely many possible solutions (e.g., due to changes in the global coordinate system). ### Background - **Diversity of Visual Scenes**: Visual scenes are highly diverse, not only because of the infinite combinations of objects and backgrounds but also because the observation results of the same scene can vary greatly under different viewpoints. - **Human Perceptual Ability**: Humans can perceive scenes from multiple viewpoints and achieve object constancy, even when the specific viewpoints are unknown. This ability is crucial for recognizing the same object in motion and efficiently learning visual information. ### Limitations of Existing Methods - Most existing deep generative models can only learn compositional representations from a single viewpoint. - Although some methods can handle multiple viewpoints, they usually require viewpoint annotations or assume temporal relationships between viewpoints, limiting their application in fully unsupervised scenarios. ### Proposed Method - **OCLOC Model**: The authors propose a deep generative model named Object-Centric Learning with Object Constancy (OCLOC), which can learn compositional scene representations from multiple unknown and unrelated viewpoints without any supervision (including viewpoint annotations). - **Model Structure**: OCLOC divides scene representations into viewpoint-independent parts (object-centric representations) and viewpoint-related parts. By iteratively updating the parameters of latent variables, the model can extract object-centric representations from information across different viewpoints. - **Inference Method**: The amortized variational inference method is used, integrating information from different viewpoints into the approximate posterior distribution of latent variables through neural networks. ### Experimental Validation - Experiments conducted on several specially designed synthetic datasets show that OCLOC can effectively learn from multiple unspecified viewpoints without supervision, performing comparably to or slightly better than existing methods. ### Summary The paper proposes a new method to address the problem of unsupervised learning of compositional scene representations from multiple unspecified viewpoints, particularly achieving object constancy and representation decoupling, which is significant for understanding and modeling complex visual scenes.

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation

Self-supervised Visual Reinforcement Learning with Object-centric Representations

Learning Unseen Concepts Via Hierarchical Decomposition and Composition

Unsupervised Discovery of Object-Centric Neural Fields

Compositional scene modeling with global object-centric representations

Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation

Compositional Scene Representation Learning via Reconstruction: A Survey

Learning Global Object-Centric Representations via Disentangled Slot Attention

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Learning Generative Models of Scene Features

Learning to Infer Unseen Attribute-Object Compositions

Variational Inference for Scalable 3D Object-centric Learning

Unsupervised Joint 3D Object Model Learning and 6D Pose Estimation for Depth-Based Instance Segmentation.

A Visual Navigation Perspective for Category-Level Object Pose Estimation