Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation

Zhaochong An,Guolei Sun,Yun Liu,Runjia Li,Min Wu,Ming-Ming Cheng,Ender Konukoglu,Serge Belongie
2024-10-30
Abstract:Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at <a class="link-external link-https" href="https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address a key issue in few-shot 3D point cloud semantic segmentation (FS-PCS), which is how to leverage multimodal information to improve the model's performance when dealing with new categories. Existing FS-PCS methods mainly focus on single-modal point cloud input, ignoring the potential advantages of multimodal information. This paper introduces a cost-effective multimodal FS-PCS setup, utilizing text labels and 2D image modalities, and proposes a new model—the Multimodal Few-Shot Segmentation Network (MM-FSS)—to effectively fuse information from different modalities, thereby achieving better 3D point cloud semantic segmentation in few-shot scenarios. ### Specific Problems 1. **Limitations of Existing Methods**: - Existing FS-PCS methods mainly rely on single-modal point cloud input, ignoring the potential of multimodal information. - These methods usually focus only on the point cloud modality, neglecting other useful modalities such as category names and 2D images. 2. **Potential of Multimodal Information**: - From a neuroscience perspective, human cognitive learning is multimodal, with different modalities showing strong correspondence within the same concept. - Especially, multimodal signals such as vision and language exhibit performance that surpasses single visual information in certain tasks. 3. **Specific Problems**: - How to effectively utilize additional modal information in few-shot 3D point cloud semantic segmentation? - How to design a model that can fully leverage multimodal information to improve segmentation performance in few-shot scenarios? ### Solutions 1. **Multimodal FS-PCS Setup**: - Introduce text labels and 2D image modalities as additional input information. - Use a pre-training approach to simulate 2D features with 3D features, thus utilizing the information without needing 2D images. 2. **MM-FSS Model**: - Use a shared 3D backbone network and two heads to extract cross-modal and single-modal features respectively. - Design a Multimodal Correlation Fusion (MCF) module and a Multimodal Semantic Fusion (MSF) module to effectively fuse information from different modalities. - Propose a Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias and further improve generalization ability. 3. **Experimental Validation**: - Conduct experiments on the S3DIS and ScanNet datasets, showing that MM-FSS significantly outperforms existing methods under various few-shot settings. - Extensive ablation studies further validate the effectiveness of each module and the value of multimodal information. ### Summary By introducing multimodal information, particularly text labels and 2D image modalities, this paper addresses the limitations of existing few-shot 3D point cloud semantic segmentation methods and proposes a new model, MM-FSS, which significantly improves segmentation performance in few-shot scenarios. This research provides valuable insights for future studies in the field of few-shot 3D point cloud semantic segmentation.