Abstract:This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the two major fundamental challenges in Arbitrary - Modality Salient Object Detection (AM SOD): 1. **Wider Modal Differences**: AM SOD usually needs to handle more modality types than existing Fixed - Modality SOD (FM SOD) models. Due to different imaging mechanisms, each modality has its unique characteristics, which leads to more diverse differences between modalities. Such differences pose a significant challenge to the effective extraction of discriminative unimodal features, especially when using a limited number of parameters. 2. **Uncertain Number of Modalities in the Input Multimodal Fusion Strategy**: AM SOD models can receive various inputs, from a single image (such as RGB data) to two images (such as RGB - D or RGB - T data) or even three images (such as RGB - D - T data). Therefore, unlike existing FM SOD models that only need to fuse a fixed number of unimodal features, AM SOD models must have the ability to dynamically fuse different numbers of unimodal features. To solve the above problems, the paper proposes a new Modal - Adaptive Transformer (MAT), which specifically includes the following aspects: - **Modal - Adaptive Feature Extractor (MAFE)**: By introducing modality prompts, MAFE can adaptively adjust its feature space during the feature extraction process to adapt to the characteristics of the input modalities. In the training phase, a new Modality Translation Contraction (MTC) loss is designed to help MAFE learn more discriminative modality prompts, so as to better adjust the feature space in the testing phase. - **Channel - level and Spatial - level Fusion Hybrid Strategy (CSFH)**: In order to dynamically and effectively utilize the complementary semantic and detail information across modalities, the CSFH strategy combines the Channel - level Dynamic Fusion Module (CDFM) and the Spatial - level Dynamic Fusion Module (SDFM). CDFM and SDFM are respectively used to fuse the unimodal features of different modalities and align them according to the different hierarchical characteristics of the features, so as to more effectively utilize the complementary information. Through these innovations, MAT can achieve significant performance improvements in dealing with more diverse modal differences and dynamic fusion problems.

Modality Prompts for Arbitrary Modality Salient Object Detection

Salient Object Detection From Arbitrary Modalities

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Unified-modal Salient Object Detection via Adaptive Prompt Learning

Enabling modality interactions for RGB-T salient object detection

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

RGB-D Salient Object Detection with Cross-Modality Modulation and Selection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Deep Correlated Prompting for Visual Recognition with Missing Modalities

RGBD Salient Object Detection via Disentangled Cross-modal Fusion

Multimodal Prompting with Missing Modalities for Visual Recognition

Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Visual Prompt Flexible-Modal Face Anti-Spoofing

Modality-Guided Subnetwork for Salient Object Detection

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Semantic feature-guided and correlation-aggregated salient object detection

Multi-interactive Dual-decoder for RGB-thermal Salient Object Detection