Abstract:Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at <a class="link-external link-http" href="http://github.com/aminebdj/OpenYOLO3D" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in the open - vocabulary 3D instance segmentation task: **slow inference speed** and **high computational resource requirements**. Specifically, although existing methods perform well in recognizing objects of new classes, they rely on computationally expensive 2D base models (such as Segment Anything Model, SAM) and CLIP models, which leads to overly long inference times, often taking several minutes to process a single scene. This high computational cost limits the applicability of these methods in practical applications, especially in application scenarios that require fast and accurate predictions. To solve these problems, the paper proposes a new method named **Open - YOLO 3D**, which improves efficiency in the following ways: 1. **Reduce dependence on 2D segmentation models**: Existing methods usually use 2D segmentation models to generate 3D mask features, while Open - YOLO 3D only relies on 2D object detection models to generate bounding boxes, avoiding redundant calculations. 2. **Accelerate 3D mask visibility calculation**: By introducing accelerated visibility calculation (V Acc), Open - YOLO 3D can calculate the visibility of 3D masks in all frames at once instead of iteratively calculating frame by frame. 3. **Multi - view prompt distribution (MVPDist)**: Utilize multi - view information to predict the best text prompt for each 3D mask, thereby achieving efficient and accurate open - vocabulary 3D instance segmentation. Through these improvements, Open - YOLO 3D not only reaches the state - of - the - art level in performance but also achieves a speed - up of up to 16 times on the ScanNet200 validation set, making this method more suitable for the rapid decision - making requirements in practical applications. ### Formula presentation To understand the technical details in the paper more clearly, here are several key formulas: 1. **3D point cloud projection onto 2D image**: \[ P^{2D}_i = I_i \cdot E_i \cdot P \] where $P$ is the 3D point cloud, and $I_i$ and $E_i$ are the internal parameter matrix and external parameter matrix of the $i$-th frame respectively. 2. **Visibility calculation**: \[ V_f = 1(0 < P^{2D}_x < W) \odot 1(0 < P^{2D}_y < H) \] \[ V_d = 1(|P^{2D}_z - D_z| < \tau_{\text{depth}}) \] \[ V = (V_f \odot V_d) \cdot M^T \odot M^{-1}_{\text{count}} \] 3. **Multi - view prompt distribution (MVPDist)**: \[ D_j = \left\{ L_i \left[ P^{2D}_{i,x} \cdot M_{ji}, P^{2D}_{i,y} \cdot M_{ji} \right] \mid \forall i \in P_k \right\} \] where $M_{ji}$ is the mask of non - occluded points belonging to the $j$-th instance, and $P_k$ is the set of frame indices where the $j$-th 3D mask has the highest visibility. These formulas show how Open - YOLO 3D achieves fast and accurate 3D instance segmentation through efficient calculation and optimization.

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

A robust multiclass 3D object recognition based on modern YOLO deep learning algorithms

INSTA-YOLO: Real-Time Instance Segmentation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Three-Dimensional Object Segmentation Method based on YOLO, SAM, and NeRF

Open-Ended 3D Point Cloud Instance Segmentation

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

SAI3D: Segment Any Instance in 3D Scenes

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

ODIN: A Single Model for 2D and 3D Segmentation

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

YOLO2U-Net: Detection-guided 3D instance segmentation for microscopy

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

OVO: Open-Vocabulary Occupancy