Abstract:A modality separate tri‐stream net is proposed, which fully explored the potential of shared features and specific features between different modalities, and completed the clear separation and extraction of the two features of different modalities, thereby improving the utilisation of the multi‐modal data. Most of the existing RGB‐T salient object detection methods are usually based on dual‐stream encoding single‐stream decoding network architecture. These models always rely on the quality of fusion features, which often focus on modality‐shared features and overlook modality‐specific features, thus failing to fully utilise the rich information contained in multi‐modality data. To this end, a modality separate tri‐stream net (MSTNet), which consists of a tri‐stream encoding (TSE) structure and a tri‐stream decoding (TSD) structure is proposed. The TSE explicitly separates and extracts the modality‐shared and modality‐specific features to improve the utilisation of multi‐modality data. In addition, based on the hybrid‐attention and cross‐attention mechanism, we design an enhanced complementary fusion module (ECF), which fully considers the complementarity between the features to be fused and realises high‐quality feature fusion. Furthermore, in TSD, the quality of uni‐modality features is ensured under the constraint of supervision. Finally, to make full use of the rich multi‐level and multi‐scale decoding features contained in TSD, the authors design the adaptive multi‐scale decoding module and the multi‐stream feature aggregation module to improve the decoding capability. Extensive experiments on three public datasets show that the MSTNet outperforms 14 state‐of‐the‐art methods, demonstrating that this method can extract and utilise the multi‐modality information more adequately and extract more complete and rich features, thus improving the model's performance. The code will be released at https://github.com/JOOOOKII/MSTNet.

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Enabling modality interactions for RGB-T salient object detection

Lightweight Cross-Modal Information Mutual Reinforcement Network for RGB-T Salient Object Detection

RGBD Salient Object Detection via Disentangled Cross-modal Fusion

Interactive Context-Aware Network for RGB-T Salient Object Detection

Compensated Attention Feature Fusion and Hierarchical Multiplication Decoder Network for RGB-D Salient Object Detection

MSEDNet: Multi-scale fusion and edge-supervised network for RGB-T salient object detection

Multi-modality information refinement fusion network for RGB-D salient object detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

MFCINet: multi-level feature and context information fusion network for RGB-D salient object detection

UMINet: a unified multi-modality interaction network for RGB-D and RGB-T salient object detection

Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection

HFMDNet: Hierarchical Fusion and Multilevel Decoder Network for RGB-D Salient Object Detection

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

Leveraging modality‐specific and shared features for RGB‐T salient object detection

Modality-Guided Subnetwork for Salient Object Detection

Multi-level cross-modal interaction network for RGB-D salient object detection

SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection