Abstract:Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at <a class="link-external link-https" href="https://xuxw98.github.io/ESAM/" rel="external noopener nofollow">this https URL</a>, with only one RTX 3090 GPU required for training and evaluation.

SAI3D: Segment Any Instance in 3D Scenes

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

SAM3D: Segment Anything in 3D Scenes

SAMPart3D: Segment Any Part in 3D Objects

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Segment Anything in 3D with Radiance Fields

Segment Anything in 3D with NeRFs.

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

SAM-guided Graph Cut for 3D Instance Segmentation

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

SAM3D: Zero-Shot 3D Object Detection Via the Segment Anything Model

Point-SAM: Promptable 3D Segmentation Model for Point Clouds

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

Segment anything model 2: an application to 2D and 3D medical images

Open-set Hierarchical Semantic Segmentation for 3D Scene

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation

UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes