Abstract:Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at <a class="link-external link-https" href="https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address a key issue in few-shot 3D point cloud semantic segmentation (FS-PCS), which is how to leverage multimodal information to improve the model's performance when dealing with new categories. Existing FS-PCS methods mainly focus on single-modal point cloud input, ignoring the potential advantages of multimodal information. This paper introduces a cost-effective multimodal FS-PCS setup, utilizing text labels and 2D image modalities, and proposes a new model—the Multimodal Few-Shot Segmentation Network (MM-FSS)—to effectively fuse information from different modalities, thereby achieving better 3D point cloud semantic segmentation in few-shot scenarios. ### Specific Problems 1. **Limitations of Existing Methods**: - Existing FS-PCS methods mainly rely on single-modal point cloud input, ignoring the potential of multimodal information. - These methods usually focus only on the point cloud modality, neglecting other useful modalities such as category names and 2D images. 2. **Potential of Multimodal Information**: - From a neuroscience perspective, human cognitive learning is multimodal, with different modalities showing strong correspondence within the same concept. - Especially, multimodal signals such as vision and language exhibit performance that surpasses single visual information in certain tasks. 3. **Specific Problems**: - How to effectively utilize additional modal information in few-shot 3D point cloud semantic segmentation? - How to design a model that can fully leverage multimodal information to improve segmentation performance in few-shot scenarios? ### Solutions 1. **Multimodal FS-PCS Setup**: - Introduce text labels and 2D image modalities as additional input information. - Use a pre-training approach to simulate 2D features with 3D features, thus utilizing the information without needing 2D images. 2. **MM-FSS Model**: - Use a shared 3D backbone network and two heads to extract cross-modal and single-modal features respectively. - Design a Multimodal Correlation Fusion (MCF) module and a Multimodal Semantic Fusion (MSF) module to effectively fuse information from different modalities. - Propose a Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias and further improve generalization ability. 3. **Experimental Validation**: - Conduct experiments on the S3DIS and ScanNet datasets, showing that MM-FSS significantly outperforms existing methods under various few-shot settings. - Extensive ablation studies further validate the effectiveness of each module and the value of multimodal information. ### Summary By introducing multimodal information, particularly text labels and 2D image modalities, this paper addresses the limitations of existing few-shot 3D point cloud semantic segmentation methods and proposes a new model, MM-FSS, which significantly improves segmentation performance in few-shot scenarios. This research provides valuable insights for future studies in the field of few-shot 3D point cloud semantic segmentation.

Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation

Rethinking Few-shot 3D Point Cloud Semantic Segmentation

CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation

PointMS: Semantic Segmentation for Point Cloud Based on Multi-scale Directional Convolution

Multi-Similarity Enhancement Network for Few-Shot Segmentation.

MFFNet: Multimodal Feature Fusion Network for Point Cloud Semantic Segmentation

Generalized Few-Shot Point Cloud Segmentation Via Geometric Words

Multi-modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

Towards Robust Few-shot Point Cloud Semantic Segmentation

Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement

Adaptive Similarity-Guided Self-Merging Network for Few-Shot Semantic Segmentation

3DCFS: Fast and Robust Joint 3D Semantic-Instance Segmentation via Coupled Feature Selection

Efficient Multi-Modal High-Precision Semantic Segmentation from MLS Point Cloud Without 3D Annotation

Joint Semantic Segmentation using representations of LiDAR point clouds and camera images

Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks

Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

Few-shot 3D Point Cloud Semantic Segmentation

Few-shot Point Cloud Semantic Segmentation Via Support-Query Feature Interaction