Abstract:With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given only natural language as input. One such application area is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the spatial reasoning and semantic understanding required, particularly in arbitrary scenes that may contain many objects belonging to fine-grained classes. To address this challenge, we curate the largest real-world dataset for Vision and Language-guided Action in 3D Scenes (VLA-3D), consisting of over 11.5K scanned 3D indoor rooms from existing datasets, 23.5M heuristically generated semantic relations between objects, and 9.7M synthetically generated referential statements. Our dataset consists of processed 3D point clouds, semantic object and room annotations, scene graphs, navigable free space annotations, and referential language statements that specifically focus on view-independent spatial relations for disambiguating objects. The goal of these features is to aid the downstream task of navigation, especially on real-world systems where some level of robustness must be guaranteed in an open world of changing scenes and imperfect language. We benchmark our dataset with current state-of-the-art models to obtain a performance baseline. All code to generate and visualize the dataset is publicly released, see <a class="link-external link-https" href="https://github.com/HaochenZ11/VLA-3D" rel="external noopener nofollow">this https URL</a>. With the release of this dataset, we hope to provide a resource for progress in semantic 3D scene understanding that is robust to changes and one which will aid the development of interactive indoor navigation systems.

VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Benchmarking Large-Scale Multi-View 3D Reconstruction Using Realistic Synthetic Images

360+x: A Panoptic Multi-modal Scene Understanding Dataset

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

VCVW-3D: A Virtual Construction Vehicles and Workers Dataset with 3D Annotations

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

UniVision: A Unified Framework for Vision-Centric 3D Perception

Multi-Modal Dataset Acquisition for Photometrically Challenging Object

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

A Real 3D Embodied Dataset for Robotic Active Visual Learning

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation

EnvoDat: A Large-Scale Multisensory Dataset for Robotic Spatial Awareness and Semantic Reasoning in Heterogeneous Environments

3D Concept Learning and Reasoning from Multi-View Images

An Immersive Multi-Elevation Multi-Seasonal Dataset for 3D Reconstruction and Visualization

Large-Scale Indoor Visual-Geometric Multimodal Dataset and Benchmark for Novel View Synthesis

EDVAM: a 3D eye-tracking dataset for visual attention modeling in a virtual museum

PanDepth: Joint Panoptic Segmentation and Depth Completion

Delving into Multi-illumination Monocular Depth Estimation: A New Dataset and Method