Abstract:Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{<a class="link-external link-https" href="https://github.com/Angknpng/Sammese" rel="external noopener nofollow">this https URL</a>}.

Multimodal salient object detection via adversarial learning with collaborative generator

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

cmSalGAN: RGB-D Salient Object Detection With Cross-View Generative Adversarial Networks

Salient Object Detection Based on Visual Perceptual Saturation and Two-Stream Hybrid Networks.

Unified-modal Salient Object Detection via Adaptive Prompt Learning

Multimodal Adversarially Learned Inference with Factorized Discriminators

Salient Object Detection From Arbitrary Modalities

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

MSEDNet: Multi-scale fusion and edge-supervised network for RGB-T salient object detection

RGB-D Salient Object Detection with Cross-Modality Modulation and Selection

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search

Multi-interactive Dual-decoder for RGB-thermal Salient Object Detection

Memory-aided Contrastive Consensus Learning for Co-salient Object Detection

Research on Multimodal Image Fusion Target Detection Algorithm Based on Generative Adversarial Network

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Enabling modality interactions for RGB-T salient object detection

MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection

Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection