Abstract:Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{<a class="link-external link-https" href="https://github.com/Angknpng/Sammese" rel="external noopener nofollow">this https URL</a>}.

SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition

Video Action Recognition with Attentive Semantic Units

Scene adaptive mechanism for action recognition

Harnessing Object and Scene Semantics for Large-Scale Video Understanding.

Dense Semantics-Assisted Networks For Video Action Recognition

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Scene-Aware Feature Matching

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

Semantic Alignment Network for Multi-modal Emotion Recognition

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Segment Anything with Multiple Modalities

A Study of Actor and Action Semantic Retention in Video Supervoxel Segmentation

Semantic-aware Video Representation for Few-shot Action Recognition

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Segment Anything for Videos: A Systematic Survey

A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

Learning Semantic Feature Map for Visual Content Recognition