VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Haochen Zhang,Nader Zantout,Pujith Kachana,Zongyuan Wu,Ji Zhang,Wenshan Wang
2024-11-06
Abstract:With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given only natural language as input. One such application area is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the spatial reasoning and semantic understanding required, particularly in arbitrary scenes that may contain many objects belonging to fine-grained classes. To address this challenge, we curate the largest real-world dataset for Vision and Language-guided Action in 3D Scenes (VLA-3D), consisting of over 11.5K scanned 3D indoor rooms from existing datasets, 23.5M heuristically generated semantic relations between objects, and 9.7M synthetically generated referential statements. Our dataset consists of processed 3D point clouds, semantic object and room annotations, scene graphs, navigable free space annotations, and referential language statements that specifically focus on view-independent spatial relations for disambiguating objects. The goal of these features is to aid the downstream task of navigation, especially on real-world systems where some level of robustness must be guaranteed in an open world of changing scenes and imperfect language. We benchmark our dataset with current state-of-the-art models to obtain a performance baseline. All code to generate and visualize the dataset is publicly released, see <a class="link-external link-https" href="https://github.com/HaochenZ11/VLA-3D" rel="external noopener nofollow">this https URL</a>. With the release of this dataset, we hope to provide a resource for progress in semantic 3D scene understanding that is robust to changes and one which will aid the development of interactive indoor navigation systems.
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered during indoor navigation using natural language instructions in 3D scenes. Specifically, the paper focuses on the following key issues: 1. **Spatial Reasoning and Semantic Understanding**: It is extremely difficult to perform spatial reasoning and semantic understanding in any scene, especially in scenes containing many fine - grained objects. These scenes may contain hundreds of objects, many of which belong to fine - grained categories and there are many similar objects. 2. **Natural Language Processing**: Natural language instructions used by humans usually involve spatial relationships, functional properties, open - vocabulary language expressions, and may even be incorrect or refer to non - existent objects (for example, "the remote control on the table", while in fact the remote control is on the sofa). How to handle such complex and variable language inputs is a major challenge. 3. **Data Scale**: The scale of existing 3D vision - language datasets is far smaller than that of 2D datasets, which restricts the development of 3D vision - language learning methods. Although fundamental models have made significant progress, when applied to robotics, current methods still cannot provide the accuracy and robustness required for practical deployment. To address these problems, the paper proposes a new dataset named VLA - 3D. This dataset is generated based on multiple existing indoor environment scanning datasets and provides large - scale 3D point clouds, object - level attributes and semantic category labels, scene graphs, navigable free - space annotations, and natural language statements. These features are designed to support downstream tasks, especially navigation tasks, and in particular, in real - world systems, a certain level of robustness needs to be ensured under changing scenes and imperfect language inputs. Through this dataset, researchers hope to promote the development of semantic understanding in 3D scenes and provide resources for developing interactive indoor navigation systems that can respond to commands, ask questions, and answer questions about scenes.