Abstract:Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{<a class="link-external link-https" href="https://github.com/Angknpng/Sammese" rel="external noopener nofollow">this https URL</a>}.

Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection

SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention

Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection

SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection

Can SAM Segment Anything? When SAM Meets Camouflaged Object Detection

Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising

Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

SAM-SP: Self-Prompting Makes SAM Great Again

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

Semantic-Enhanced Point-Box Joint Prompting for Video Object Segmentation

UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection