Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

Jinyang Yuan,Tonglin Chen,Zhimeng Shen,Bin Li,Xiangyang Xue
2024-01-03
Abstract:Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called ``object constancy'' across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of unsupervised learning of compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints. Specifically, the authors propose a new problem setting: learning object-centric representations from multiple unknown and unrelated viewpoints without any supervision (including viewpoint annotations) and achieving object constancy. ### Main Problems 1. **Object Constancy**: How to achieve object constancy, i.e., recognizing the same object from different viewpoints without guidance from viewpoint information. 2. **Representation Decoupling**: How to decouple image representations into object-centric representations (viewpoint-independent parts) and viewpoint-related representations, even when there are infinitely many possible solutions (e.g., due to changes in the global coordinate system). ### Background - **Diversity of Visual Scenes**: Visual scenes are highly diverse, not only because of the infinite combinations of objects and backgrounds but also because the observation results of the same scene can vary greatly under different viewpoints. - **Human Perceptual Ability**: Humans can perceive scenes from multiple viewpoints and achieve object constancy, even when the specific viewpoints are unknown. This ability is crucial for recognizing the same object in motion and efficiently learning visual information. ### Limitations of Existing Methods - Most existing deep generative models can only learn compositional representations from a single viewpoint. - Although some methods can handle multiple viewpoints, they usually require viewpoint annotations or assume temporal relationships between viewpoints, limiting their application in fully unsupervised scenarios. ### Proposed Method - **OCLOC Model**: The authors propose a deep generative model named Object-Centric Learning with Object Constancy (OCLOC), which can learn compositional scene representations from multiple unknown and unrelated viewpoints without any supervision (including viewpoint annotations). - **Model Structure**: OCLOC divides scene representations into viewpoint-independent parts (object-centric representations) and viewpoint-related parts. By iteratively updating the parameters of latent variables, the model can extract object-centric representations from information across different viewpoints. - **Inference Method**: The amortized variational inference method is used, integrating information from different viewpoints into the approximate posterior distribution of latent variables through neural networks. ### Experimental Validation - Experiments conducted on several specially designed synthetic datasets show that OCLOC can effectively learn from multiple unspecified viewpoints without supervision, performing comparably to or slightly better than existing methods. ### Summary The paper proposes a new method to address the problem of unsupervised learning of compositional scene representations from multiple unspecified viewpoints, particularly achieving object constancy and representation decoupling, which is significant for understanding and modeling complex visual scenes.