Abstract:Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at <a class="link-external link-http" href="http://github.com/aminebdj/OpenYOLO3D" rel="external noopener nofollow">this http URL</a>.

Three-Dimensional Object Segmentation Method based on YOLO, SAM, and NeRF

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

3D Multiple-Contextual ROI-Attention Network for Efficient and Accurate Volumetric Medical Image Segmentation.

ONeRF: Unsupervised 3D Object Segmentation from Multiple Views

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

3DSAM: Segment Anything in NeRF

DiscoNeRF: Class-Agnostic Object Field for 3D Object Discovery

OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields

Obj-NeRF: Extract Object NeRFs from Multi-view Images

SegNeRF: 3D Part Segmentation with Neural Radiance Fields

SANeRF-HQ: Segment Anything for NeRF in High Quality

Unsupervised Multi-View Object Segmentation Using Radiance Field Propagation

Interactive Segment Anything NeRF with Feature Imitation

Instance Neural Radiance Field

DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

2D-Guided 3D Gaussian Segmentation

Segment Anything in 3D with Radiance Fields

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

NeSM: A NeRF-Based 3D Segmentation Methodfor Ultrasound Images