Abstract:Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at <a class="link-external link-https" href="https://github.com/wangzy22/XMask3D" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to perform more accurate fine - grained geometric segmentation on new categories in the open - vocabulary 3D semantic segmentation task. Existing methods mainly focus on establishing a unified feature space covering 3D, 2D and text modalities. However, traditional techniques such as global feature alignment or vision - language model distillation can often only achieve approximate correspondences, especially performing poorly in defining fine - grained segmentation boundaries. To solve this problem, the authors propose a more refined mask - level - based alignment method to achieve more accurate alignment between 3D features and 2D - text embedding spaces through the cross - modal mask reasoning framework XMask3D. Specifically, XMask3D consists of three main parts: 1. **3D Geometry Extraction Branch**: Responsible for extracting geometric features from 3D point clouds. 2. **2D Mask Generation Branch**: Generate 2D masks with open - vocabulary capabilities based on the denoising UNet in the pre - trained diffusion model. 3. **3D - 2D Feature Fusion Module**: Combine the features of the 3D and 2D branches to enhance the performance of the model in the open - vocabulary 3D semantic segmentation task. Through these designs, XMask3D can show competitive performance on multiple benchmark datasets and perform excellently in the open - vocabulary 3D semantic segmentation task. Specific technical details include: - **3D to 2D Mask Generation**: Use global point cloud features as conditional inputs to generate geometric - aware masks that are more suitable for transfer to the 3D modality. - **2D to 3D Mask Regularization**: Apply mask - level regularization on 3D features to align the vision - language embedding space, enhance the open - vocabulary capabilities of 3D features on new categories, while retaining fine - grained geometric information. - **3D - 2D Mask Feature Fusion**: Merge mask features from two modalities in the fusion block to enhance the synergy between 2D and 3D features. These techniques jointly improve the performance of the model in the open - vocabulary 3D semantic segmentation task, especially the fine - grained geometric segmentation ability when dealing with new categories.

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Delving Deeper into Mask Utilization in Video Object Segmentation

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

SAM-Guided Masked Token Prediction for 3D Scene Understanding

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

MaskMentor: Unlocking the Potential of Masked Self-Teaching for Missing Modality RGB-D Semantic Segmentation

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding