Abstract:We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach, we conducted experiments on three prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is Open - Vocabulary 3D Instance Segmentation (OV - 3DIS). Specifically, the authors aim to obtain binary instance masks of arbitrary categories from 3D scenes, and these categories may not exist during the training phase. The challenges of this problem are as follows: 1. **Limitations of traditional methods**: Existing fully - supervised 3D instance segmentation methods are limited by the closed - set framework and can only recognize predefined object categories. 2. **Difficulties in recognizing small - scale and geometrically ambiguous objects**: Existing methods have difficulty in recognizing small - scale or geometrically unclear objects, especially those from rare categories. 3. **Challenges in 2D - to - 3D mapping**: When mapping instance masks in 2D images to 3D point clouds, background points or other irrelevant regions may be included, resulting in poor - quality 3D proposals. To solve these problems, the paper proposes Open3DIS, a new open - vocabulary 3D instance segmentation method. Its main contributions include: 1. **2D - Guided 3D Proposal Module**: By aggregating 2D instance masks in multi - view RGB - D images and mapping them to geometrically coherent point cloud regions, high - quality 3D object proposals are generated. 2. **Pointwise Feature Extraction**: A new point - by - point feature extraction method for open - vocabulary 3D object proposals is introduced. 3. **Excellent performance**: Experiments on three datasets, ScanNet200, S3DIS, and Replica, show that Open3DIS significantly outperforms existing methods in the OV - 3DIS task, especially when dealing with rare categories. ### Formula presentation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: - **Calculation of point cloud feature similarity**: \[ s'_{i,j}=\cos(f^{3D}_i,f^{3D}_j) \] where \(f^{3D}_i\) and \(f^{3D}_j\) are the feature vectors of two point cloud regions respectively, obtained by the average of their point features. - **Calculation of the matching score matrix**: \[ c_{i,j}=1(o'_{i,j}>\tau_{iou})\odot1(s'_{i,j}>\tau_{sim}) \] where \(o'_{i,j}=\text{IoU}(r_i,r_j)\) represents the intersection - over - union ratio of two regions, \(\tau_{iou}\) and \(\tau_{sim}\) are thresholds, \(1(\cdot)\) is an indicator function, and \(\odot\) is a logical AND operator. - **CLIP feature calculation**: \[ F^{\text{CLIP}} = NV\left(\sum_k\sum_\lambda(\nu_\lambda\ast f^{\text{CLIP}}_{\lambda,k})\ast m^{3D}_k\right) \] where \(f^{\text{CLIP}}_{\lambda,k}\) is the 2D CLIP image feature of the \(k\) - th instance in the \(\lambda\) - th view, \(\nu_\lambda\) is the visibility map of view \(\lambda\), \(m^{3D}_k\) is the binary mask of the \(k\) - th proposal, and \(NV(x)\) is the L2 - normalized vector of \(x\). - **Final score between text query and 3D mask**: \[ s^{\text{CLIP}}_{k,\rho}=\frac{1}{|m^{3D}_k|}\sum_n\cos(F^{\text{CLIP}}\ast m^{3D}_k,e_\rho) \]

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Open-Ended 3D Point Cloud Instance Segmentation

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

SAI3D: Segment Any Instance in 3D Scenes

OccuSeg: Occupancy-Aware 3D Instance Segmentation

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

RESSCAL3D++: Joint Acquisition and Semantic Segmentation of 3D Point Clouds

Associate Semantic-Instance Segmentation of 3D Point Clouds Based on Local Feature Extraction

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation