Abstract:Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system's adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the 3D instance segmentation task in an open - vocabulary environment. Specifically, traditional 3D instance segmentation methods are usually limited to the closed - vocabulary paradigm, that is, the object categories to be segmented are predefined in the training stage. However, in real - world applications, objects not in these predefined categories are often encountered, which makes the closed - vocabulary methods less flexible and adaptable. To solve this problem, the paper proposes a Zero - Shot Dual - Path Integration Framework, aiming to achieve the recognition and segmentation of known and unknown objects by combining the advantages of 3D point cloud data and 2D multi - view images. ### Main contributions of the paper: 1. **Zero - shot 3D instance segmentation**: By making full use of pre - trained models in 3D and 2D modalities, a zero - shot framework is proposed, which can achieve 3D instance segmentation without relying on pre - trained models of specific modalities. 2. **Dual - path integration**: A dual - path integration framework is introduced, including a conditional integration process, which effectively combines the instance mask proposals generated by the 3D and 2D paths, improving the quality and diversity of mask proposals. 3. **Enhanced overall performance**: Through evaluation on the ScanNet200 and ARKitScenes datasets, the superior performance of this framework in the open - vocabulary 3D instance segmentation task is verified. ### Method overview: - **3D path**: Use a pre - trained 3D instance segmentation network to generate spatially accurate class - agnostic mask proposals from 3D point cloud data and extract the visual features of each mask. - **2D path**: Utilize a pre - trained open - vocabulary 2D instance segmentation network to generate 2D mask proposals from multi - view RGB - D images and project them into the 3D point cloud, and refine them through an instance fusion module. - **Dual - path integration**: Through a conditional integration process, including bimodal proposal matching and adaptive integration, filter and merge the proposals from the 3D and 2D paths to ensure the high quality and high diversity of the final output. ### Experimental results: - **Quantitative results**: On the ScanNet200 dataset, this framework significantly outperforms existing methods in the mean precision (AP) metric, especially in the "head" and "common" categories. - **Qualitative results**: The efficiency and adaptability of this framework in handling known and unknown objects are shown through visualization results, especially in the ARKitScenes dataset. ### Summary: This paper solves the key challenges in open - vocabulary 3D instance segmentation by proposing a Zero - Shot Dual - Path Integration Framework, achieving effective recognition and segmentation of known and unknown objects. This method not only performs well in quantitative evaluation but also shows strong adaptability and robustness in practical applications.

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation

A Unified Framework for 3D Scene Understanding

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

Open-Ended 3D Point Cloud Instance Segmentation

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

A Two-Pipeline Instance Segmentation Network via Boundary Enhancement for Scene Understanding

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

A Simple Framework for Open-Vocabulary Segmentation and Detection

Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

OccuSeg: Occupancy-Aware 3D Instance Segmentation

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths.