Abstract:Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multimodal data processing, the existing Segment Anything Model (SAM) is mainly designed for unimodal RGB images, which limits its application in multi - sensor fusion scenarios. Specifically, the paper points out that SAM performs poorly when processing non - RGB modal data, such as data in modalities like LiDAR, depth, and thermal imaging. This restricts the applicability and performance of SAM in multimodal data and sensor suites. To overcome these limitations, the paper proposes Multi - Modal SAM (MM - SAM), an extended and improved SAM model aimed at supporting cross - modal and multimodal processing, thereby achieving more robust and enhanced segmentation effects. MM - SAM achieves this goal through the following two key technical designs: 1. **Unsupervised Cross - Modal Transfer (UCMT)**: By introducing a modality - specific patch embedding module and a parameter - efficient tuning method, UCMT can extract modality - specific features from different sensors and ensure that different modalities have a unified representation in the output latent space of the SAM image encoder through the Embedding Unification Loss. This significantly improves the segmentation ability of MM - SAM on each modality. 2. **Weakly - supervised Multi - Modal Fusion (WMMF)**: WMMF realizes the effective fusion of multimodal embeddings by introducing a Selective Fusion Gate (SFG). The SFG generates per - patch weights according to all input sensor modalities, thereby achieving the weighted fusion of different modality embeddings. In addition, WMMF also introduces a Multi - Modal Pseudo Labeling method, which generates pseudo - labels through geometric cues for training the Selective Fusion Gate, thus realizing a training process without the need for real labels. Through these designs, MM - SAM can effectively address the following three main challenges: 1. **Adapting to diverse non - RGB sensors**: Enabling SAM to process unimodal non - RGB data. 2. **Co - processing multimodal data**: Improving the robustness and accuracy of segmentation through sensor fusion. 3. **Training without mask annotation**: Allowing MM - SAM to be efficiently and parameter - efficiently adapted in different downstream tasks without the need for additional mask annotation. Experimental results show that MM - SAM consistently outperforms SAM on multiple multimodal datasets, demonstrating its effectiveness and robustness under various sensors and data modalities.

Segment Anything with Multiple Modalities

Adapting the Segment Anything Model for Multi-modal Retinal Anomaly Detection and Localization

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

MAS-SAM: Segment Any Marine Animal with Aggregated Features

Semantic-SAM: Segment and Recognize Anything at Any Granularity

Tuning a SAM-Based Model with Multi-Cognitive Visual Adapter to Remote Sensing Instance Segmentation

A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering

MA-SAM: Modality-agnostic SAM adaptation for 3D medical image segmentation

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Segment Any Medical Model Extended

MaskSAM: Towards Auto-prompt SAM with Mask Classification for Medical Image Segmentation

Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

RingMo-SAM: A Foundation Model for Segment Anything in Multimodal Remote-Sensing Images

Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation