Abstract:The use of complementary information, namely depth or thermal information, has shown its benefits to salient object detection (SOD) during recent years. However, the RGB-D or RGB-T SOD problems are currently only solved independently, and most of them directly extract and fuse raw features from backbones. Such methods can he easily restricted by low-quality modality data and redundant cross-modal features. In this work, a unified end-to-end framework is designed to simultaneously analyze RCB-D and RGB-T SOD tasks. Specifically, to effectively tackle multi-modal features, we propose a novel multi-stage and multi-scale fusion network (MMNet), which consists of a cross-modal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the visual color stage doctrine in the human visual system (HVS), the proposed CMFM aims to explore important feature representations in feature response stage, and integrate them into cross-modal features in adversarial combination stage. Moreover, the proposed BMD learns the combination of multilevel cross-modal fused features to capture both local and global information of salient objects, and can further boost the multimodal SOD performance. The proposed unified cross-modality feature analysis framework based on two-stage and multi-scale information fusion can be used for diverse multi-modal SOD tasks. Comprehensive experiments (similar to 92K image-pairs) demonstrate that the proposed method consistently outperforms the other 21 state-of-the-art methods on nine benchmark datasets. This validates that our proposed method can work well on diverse multi-modal SOD tasks with good generalization and robustness, and provides a good multi-modal SOD benchmark.

Cross-modal multi-scale feature fusion-based RGB-T saliency object detection method

Multi-Frame Image Fusion Method Combining Spatial-Temporal Saliency Detection and Nsct

RGB-T Salient Object Detection Via Fusing Multi-level CNN Features.

Revisiting Feature Fusion for RGB-T Salient Object Detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

CFRNet: Cross-Attention-Based Fusion and Refinement Network for Enhanced RGB-T Salient Object Detection

CGFNet: Cross-Guided Fusion Network for RGB-T Salient Object Detection

RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion

RGB-D salient object detection via cross-modal joint feature extraction and low-bound fusion loss

RGB-T salient object detection via CNN feature and result saliency map fusion

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Lightweight Cross-Modal Information Mutual Reinforcement Network for RGB-T Salient Object Detection

Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection

Global Guided Cross-Modal Cross-Scale Network for RGB-D Salient Object Detection

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Interactive Context-Aware Network for RGB-T Salient Object Detection

CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection

Discriminative feature fusion for RGB-D salient object detection

Enabling modality interactions for RGB-T salient object detection

Coordinate Attention Filtering Depth-Feature Guide Cross-Modal Fusion RGB-Depth Salient Object Detection

MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection