SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia,Yixin Chen,Huangyue Yu,Yan Wang,Xuesong Niu,Tengyu Liu,Qing Li,Siyuan Huang
2024-09-24
Abstract:3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: <a class="link-external link-https" href="https://scene-verse.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Robotics
What problem does this paper attempt to address?
The paper aims to address two major challenges in the field of 3D Vision-Language (3D-VL): 1. **Data Scarcity**: Compared to the 2D domain, the scale of datasets for 3D scenes is smaller, especially in terms of complex object configurations, rich attributes, and intricate relationships. There is a lack of sufficient paired 3D vision-language data to support grounded learning in 3D scenes. 2. **Lack of a Unified Learning Framework**: Currently, there is no unified learning framework to distill knowledge from grounded 3D data. To tackle these issues, the paper proposes SceneVerse—the first million-scale 3D vision-language dataset, containing 68,000 indoor scenes and 2.5 million vision-language pairs. Additionally, the paper introduces a unified pre-training framework called GPS (Grounded Pre-training for Scenes) for 3D vision-language learning, and demonstrates its state-of-the-art performance on existing 3D vision grounding and question-answering benchmarks through extensive experiments. Furthermore, the paper shows that the data augmentation effect is not limited to GPS but is also beneficial in other tasks such as 3D semantic segmentation.