Abstract:We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve open - vocabulary instance segmentation in 3D scenes. Specifically, current 3D instance segmentation methods are usually only able to recognize object classes within a predefined closed set of classes in the training dataset. This has led to significant limitations in dealing with newly emerging and unseen objects in practical applications. For example, an autonomous robot navigating in an unknown environment may need to perform tasks according to free - form queries (such as "find the side table with a vase"), and existing closed - vocabulary 3D instance segmentation methods are difficult to handle such tasks. The paper introduces a new method named OpenMask3D, which can perform zero - shot learning on unseen object classes, thus overcoming the limitations of existing methods. OpenMask3D achieves open - vocabulary 3D instance segmentation by predicting class - agnostic 3D instance masks and using multi - view - fused CLIP image embeddings to aggregate the features of each mask. Experimental results show that OpenMask3D outperforms other open - vocabulary methods on the ScanNet200 and Replica datasets, especially performing better on long - tailed distributions. **Key Problem Summary**: 1. **Limitations of Closed Vocabulary**: Existing 3D instance segmentation methods can only recognize a limited number of predefined object classes in the training dataset and are unable to handle newly emerging or unseen objects. 2. **Practical Application Requirements**: In real - world applications, such as robot navigation and augmented reality, the ability to handle various objects according to free - form queries is required. 3. **Zero - Shot Learning Ability**: Open - vocabulary methods need to have the ability to perform zero - shot learning on unseen object classes to adapt to diverse application scenarios. By proposing OpenMask3D, the paper aims to solve the above problems and provide an effective method for performing 3D instance segmentation in an open - vocabulary environment.

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Open-Ended 3D Point Cloud Instance Segmentation

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

OccuSeg: Occupancy-Aware 3D Instance Segmentation

OpenScene: 3D Scene Understanding with Open Vocabularies

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

SAI3D: Segment Any Instance in 3D Scenes

CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation