XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Ziyi Wang,Yanbo Wang,Xumin Yu,Jie Zhou,Jiwen Lu
2024-11-20
Abstract:Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at <a class="link-external link-https" href="https://github.com/wangzy22/XMask3D" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to perform more accurate fine - grained geometric segmentation on new categories in the open - vocabulary 3D semantic segmentation task. Existing methods mainly focus on establishing a unified feature space covering 3D, 2D and text modalities. However, traditional techniques such as global feature alignment or vision - language model distillation can often only achieve approximate correspondences, especially performing poorly in defining fine - grained segmentation boundaries. To solve this problem, the authors propose a more refined mask - level - based alignment method to achieve more accurate alignment between 3D features and 2D - text embedding spaces through the cross - modal mask reasoning framework XMask3D. Specifically, XMask3D consists of three main parts: 1. **3D Geometry Extraction Branch**: Responsible for extracting geometric features from 3D point clouds. 2. **2D Mask Generation Branch**: Generate 2D masks with open - vocabulary capabilities based on the denoising UNet in the pre - trained diffusion model. 3. **3D - 2D Feature Fusion Module**: Combine the features of the 3D and 2D branches to enhance the performance of the model in the open - vocabulary 3D semantic segmentation task. Through these designs, XMask3D can show competitive performance on multiple benchmark datasets and perform excellently in the open - vocabulary 3D semantic segmentation task. Specific technical details include: - **3D to 2D Mask Generation**: Use global point cloud features as conditional inputs to generate geometric - aware masks that are more suitable for transfer to the 3D modality. - **2D to 3D Mask Regularization**: Apply mask - level regularization on 3D features to align the vision - language embedding space, enhance the open - vocabulary capabilities of 3D features on new categories, while retaining fine - grained geometric information. - **3D - 2D Mask Feature Fusion**: Merge mask features from two modalities in the fusion block to enhance the synergy between 2D and 3D features. These techniques jointly improve the performance of the model in the open - vocabulary 3D semantic segmentation task, especially the fine - grained geometric segmentation ability when dealing with new categories.