Abstract:Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{<a class="link-external link-https" href="https://github.com/Angknpng/Sammese" rel="external noopener nofollow">this https URL</a>}.

Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2

Can SAM Segment Anything? When SAM Meets Camouflaged Object Detection

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation

When SAM2 Meets Video Shadow and Mirror Detection

Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection

SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More

Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection

Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

SAMP: Adapting Segment Anything Model for Pose Estimation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention

Unleashing the Potential of SAM2 for Biomedical Images and Videos: A Survey

Evaluation of Segment Anything Model 2: The Role of SAM2 in the Underwater Environment

MFS Enhanced SAM: Achieving Superior Performance in Bimodal Few-Shot Segmentation

SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation