Abstract:Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering ``List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to get rid of the dependence on the predefined vocabulary prior in 3D instance segmentation tasks and achieve 3D instance segmentation in a fully open - vocabulary setting without any preset category information. Specifically: - **Limitations of existing methods**: Most existing 3D instance segmentation methods are open - vocabulary, which means that they can identify objects beyond the category range that appears in the training set at test time, but are still limited to the specific vocabulary provided by the user at test time. These methods cannot truly answer questions such as "What objects are there in the scene?" in an open - ended way. - **Proposed new problem setting**: This paper proposes a new task setting - Vocabulary - Free 3D Instance Segmentation (VoF3DIS). In this setting, the model needs to segment all object instances in the point cloud and assign semantic labels without any predefined vocabulary. This requires the model to be able to independently identify and understand various objects in the scene during inference without relying on a user - specified vocabulary or query. - **Challenges and significance**: This vocabulary - free setting is particularly important in dynamically changing scenes, such as in assistive robotics applications, where the objects in the scene may be unknown or undefined. Therefore, developing methods that can handle such scenes has important practical significance. ### Specific objectives 1. **Introduce a new task**: Formally define and introduce the vocabulary - free 3D instance segmentation task (VoF3DIS), that is, segment object instances in 3D point clouds and assign semantic labels without a predefined vocabulary. 2. **Propose a new method**: Design a new method named PoV o. This method uses a vision - language assistant and a 2D instance segmentation model to identify and locate objects in the scene and lift this information to 3D space, thereby generating 3D instance masks. 3. **Innovations**: - PoV o is zero - shot learning and does not need to be trained on 2D or 3D data. - Propose a new super - point merging strategy. Through spectral clustering, consider both mask consistency and semantic consistency to form the final 3D instance masks. - Be able to achieve better results than existing methods on datasets such as ScanNet200 and Replica, not only performing excellently in the vocabulary - free setting but also in the open - vocabulary setting. ### Solution overview To achieve the above objectives, PoV o adopts the following steps: 1. **Scene vocabulary generation**: Extract an object list from multi - view images through a vision - language assistant and verify it using an open - vocabulary 2D instance segmentation model to form a scene vocabulary. 2. **Super - point generation and merging**: Divide the 3D scene into geometrically consistent super - points and merge these super - points into 3D instance masks through spectral clustering, while considering the consistency of 2D instance masks and semantic consistency. 3. **Text - aligned point representation**: Generate a text - aligned feature representation for the points in each 3D instance mask for final semantic classification. Through this method, PoV o can accurately identify and segment object instances in 3D scenes without a predefined vocabulary.

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Open-Ended 3D Point Cloud Instance Segmentation

Associate Semantic-Instance Segmentation of 3D Point Clouds Based on Local Feature Extraction

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

SAI3D: Segment Any Instance in 3D Scenes

Segment Any 3D Object with Language

Open-Vocabulary Octree-Graph for 3D Scene Understanding

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

Auto-Vocabulary Segmentation for LiDAR Points

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

OccuSeg: Occupancy-Aware 3D Instance Segmentation

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation