Abstract:With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at <a class="link-external link-https" href="https://github.com/OpenRobotLab/EmbodiedScan" rel="external noopener nofollow">this https URL</a>.

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

MultiScan: Scalable RGBD scanning for 3D environments with articulated objects

Matterport3D: Learning from RGB-D Data in Indoor Environments

GenScan: A Generative Method for Populating Parametric 3D Scan Datasets

RevealNet: Seeing Behind Objects in RGB-D Scans

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Robust 3D Reconstruction with an RGB-D Camera

3D Scene Reconstruction with Sparse LiDAR Data and Monocular Image in Single Frame

Scan2CAD: Learning CAD Model Alignment in RGB-D Scans

ARKitScenes - A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

CAD-Estate: Large-scale CAD Model Annotation in RGB Videos

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes

SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans

RGBDS-SLAM: A RGB-D Semantic Dense SLAM Based on 3D Multi Level Pyramid Gaussian Splatting

Online Scene Semantic Understanding Based on Sparsely Correlated Network for AR

Developing a Comprehensive 3D Point Cloud Dataset for Construction Projects

Automatic Semantic Modeling of Indoor Scenes from Low-Quality RGB-D Data Using Contextual Information