SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia,Yixin Chen,Huangyue Yu,Yan Wang,Xuesong Niu,Tengyu Liu,Qing Li,Siyuan Huang

2024-09-24

Abstract:3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: <a class="link-external link-https" href="https://scene-verse.github.io" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Robotics

What problem does this paper attempt to address?

The paper aims to address two major challenges in the field of 3D Vision-Language (3D-VL): 1. **Data Scarcity**: Compared to the 2D domain, the scale of datasets for 3D scenes is smaller, especially in terms of complex object configurations, rich attributes, and intricate relationships. There is a lack of sufficient paired 3D vision-language data to support grounded learning in 3D scenes. 2. **Lack of a Unified Learning Framework**: Currently, there is no unified learning framework to distill knowledge from grounded 3D data. To tackle these issues, the paper proposes SceneVerse—the first million-scale 3D vision-language dataset, containing 68,000 indoor scenes and 2.5 million vision-language pairs. Additionally, the paper introduces a unified pre-training framework called GPS (Grounded Pre-training for Scenes) for 3D vision-language learning, and demonstrates its state-of-the-art performance on existing 3D vision grounding and question-answering benchmarks through extensive experiments. Furthermore, the paper shows that the data augmentation effect is not limited to GPS but is also beneficial in other tasks such as 3D semantic segmentation.

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Scaling Data Generation in Vision-and-Language Navigation

3D Scene Graph Guided Vision-Language Pre-training

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Grounded 3D-LLM with Referent Tokens

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Task-oriented Sequential Grounding in 3D Scenes

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Leveraging Large Language Models for Robot 3D Scene Understanding