Abstract:RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at <a class="link-external link-https" href="https://github.com/kingkung2016/GL-DMNet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in RGB - D salient object detection (RGB - D SOD): 1. **Challenges in cross - modal feature fusion**: - Existing methods usually directly fuse features from RGB and depth information, ignoring the inherent differences between them (such as different semantic contents). This direct fusion may lead to performance degradation. - The paper proposes a new dual - mutual - learning network (GL - DMNet), which makes full use of the dependencies between different modalities in the spatial and channel dimensions through the position - mutual - fusion module (PMF) and the channel - mutual - fusion module (CMF). 2. **Insufficient synergy of global - local associations**: - In the RGB - D SOD task, there is a problem of insufficient synergy between the global and local associations of each pixel. Especially for multi - modal learning, RGB and depth features will bring longer - distance dependencies, making it difficult for the two to be complementary. - In order to fully capture the cross - modal global - local context information, the paper designs a cascade - transformer - injection - reconstruction (CTR) decoder to integrate multi - layer fused features and enhance the global - local perception ability. ### Main contributions 1. **Proposing a new dual - mutual - learning network (GL - DMNet)**: - This network combines Transformer and CNN to extract features from RGB images and depth inputs. 2. **Designing the position - mutual - fusion module (PMF) and the channel - mutual - fusion module (CMF)**: - These modules are used for cross - modal fusion to fully explore the global and local dependencies between RGB and depth information. 3. **Developing a cascade - transformer - injection - reconstruction (CTR) decoder**: - This decoder enhances the global - local perception ability of multi - layer fused features, thereby improving the accuracy of salient object detection. 4. **Extensive experimental verification**: - Evaluations were carried out on six public datasets, and the results show that GL - DMNet outperforms 24 RGB - D SOD methods on four commonly - used evaluation metrics, with an average performance improvement of about 3%. ### Method overview - **Multi - modal feature encoder**: Use the ResNet - 50 network to extract multi - layer features from RGB images and depth maps. - **Dual - mutual - learning module**: Fuse RGB and depth features through the PMF and CMF modules to generate attention - weighted cross - modal RGB - D features. - **Cascade - transformer - injection - reconstruction decoder**: Decompose the Transformer network into four stages, independently input the fused features, and finally generate the final saliency map through the step - by - step decoding and reconstruction structure. ### Formula representation - **Position - mutual - fusion module**: \[ f_{\text{RGB}}^i=\text{Conv}_3(\text{Conv}_1(F_{\text{RGB}}^i)) \] \[ f_D^i = \text{Conv}_3(\text{Conv}_1(F_D^i)) \] \[ f_A^i = f_{\text{RGB}}^i + f_D^i \] \[ W_{\text{SP}}^i=\text{Conv}_7(\text{Cat}(\text{MaxPool}(f_A^i),\text{AvgPool}(f_A^i))) \] \[ f_{\text{SP}}^i=\text{Conv}_1(f_A^i\odot W_{\text{SP}}^i + f_A^i) \] \[ M_{s\text{RGB}}^i = M(f_{\text{SP}}^i\otimes(f_{\text{RGB}}^i)^T) \] \[ M_{sD}^i = M(f_{\text{SP}}^i\otimes(f_D^i)^T) \] \[ M_{s\text{Fu}}^i = M_{s\text{RGB}}

Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection

Double Cross-Modality Progressively Guided Network for RGB-D Salient Object Detection

Global-prior-guided fusion network for salient object detection

MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection

Specificity-preserving RGB-D Saliency Detection

SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection

An adaptive guidance fusion network for RGB-D salient object detection

Compensated Attention Feature Fusion and Hierarchical Multiplication Decoder Network for RGB-D Salient Object Detection

Dynamic Selective Network for RGB-D Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection

HFMDNet: Hierarchical Fusion and Multilevel Decoder Network for RGB-D Salient Object Detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

Lightweight Multi-modal Representation Learning for RGB Salient Object Detection

M 2rnet: Multi-modal and Multi-Scale Refined Network for RGB-D Salient Object Detection

Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

DMGNet: Depth mask guiding network for RGB-D salient object detection

DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection

M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient Object Detection

MVSalNet: Multi-view Augmentation for RGB-D Salient Object Detection