Abstract:Most existing RGB-D salient object detection (SOD) models adopt a two-stream structure to extract the information from the input RGB and depth images. Since they use two subnetworks for unimodal feature extraction and multiple multi-modal feature fusion modules for extracting cross-modal complementary information, these models require a huge number of parameters, thus hindering their real-life applications. To remedy this situation, we propose a novel middle-level feature fusion structure that allows to design a lightweight RGB-D SOD model. Specifically, the proposed structure first employs two shallow subnetworks to extract low- and middle-level unimodal RGB and depth features, respectively. Afterward, instead of integrating middle-level unimodal features multiple times at different layers, we just fuse them once via a specially designed fusion module. On top of that, high-level multi-modal semantic features are further extracted for final salient object detection via an additional subnetwork. This will greatly reduce the network's parameters. Moreover, to compensate for the performance loss due to parameter deduction, a relation-aware multi-modal feature fusion module is specially designed to effectively capture the cross-modal complementary information during the fusion of middle-level multi-modal features. By enabling the feature-level and decision-level information to interact, we maximize the usage of the fused cross-modal middle-level features and the extracted cross-modal high-level features for saliency prediction. Experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods. Remarkably, our proposed model has only 3.9M parameters and runs at 33 FPS.

RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion

RGB-D salient object detection via cross-modal joint feature extraction and low-bound fusion loss

Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection

Discriminative feature fusion for RGB-D salient object detection

CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse

Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

RGB-D Salient Object Detection with Cross-Modality Modulation and Selection

RGBD Salient Object Detection via Disentangled Cross-modal Fusion

Middle-Level Feature Fusion for Lightweight RGB-D Salient Object Detection

Discriminative unimodal feature selection and fusion for RGB-D salient object detection

Coordinate Attention Filtering Depth-Feature Guide Cross-Modal Fusion RGB-Depth Salient Object Detection

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion

Cross-modal and multi-level feature refinement network for RGB-D salient object detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

MFCINet: multi-level feature and context information fusion network for RGB-D salient object detection

Multi-modality information refinement fusion network for RGB-D salient object detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Adaptive Fusion for RGB-D Salient Object Detection.

Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection

Revisiting Feature Fusion for RGB-T Salient Object Detection

SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection