Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection

Kang Yi,Haoran Tang,Yumeng Li,Jing Xu,Jun Zhang
2025-01-03
Abstract:RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at <a class="link-external link-https" href="https://github.com/kingkung2016/GL-DMNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
This paper attempts to solve two main problems in RGB - D salient object detection (RGB - D SOD): 1. **Challenges in cross - modal feature fusion**: - Existing methods usually directly fuse features from RGB and depth information, ignoring the inherent differences between them (such as different semantic contents). This direct fusion may lead to performance degradation. - The paper proposes a new dual - mutual - learning network (GL - DMNet), which makes full use of the dependencies between different modalities in the spatial and channel dimensions through the position - mutual - fusion module (PMF) and the channel - mutual - fusion module (CMF). 2. **Insufficient synergy of global - local associations**: - In the RGB - D SOD task, there is a problem of insufficient synergy between the global and local associations of each pixel. Especially for multi - modal learning, RGB and depth features will bring longer - distance dependencies, making it difficult for the two to be complementary. - In order to fully capture the cross - modal global - local context information, the paper designs a cascade - transformer - injection - reconstruction (CTR) decoder to integrate multi - layer fused features and enhance the global - local perception ability. ### Main contributions 1. **Proposing a new dual - mutual - learning network (GL - DMNet)**: - This network combines Transformer and CNN to extract features from RGB images and depth inputs. 2. **Designing the position - mutual - fusion module (PMF) and the channel - mutual - fusion module (CMF)**: - These modules are used for cross - modal fusion to fully explore the global and local dependencies between RGB and depth information. 3. **Developing a cascade - transformer - injection - reconstruction (CTR) decoder**: - This decoder enhances the global - local perception ability of multi - layer fused features, thereby improving the accuracy of salient object detection. 4. **Extensive experimental verification**: - Evaluations were carried out on six public datasets, and the results show that GL - DMNet outperforms 24 RGB - D SOD methods on four commonly - used evaluation metrics, with an average performance improvement of about 3%. ### Method overview - **Multi - modal feature encoder**: Use the ResNet - 50 network to extract multi - layer features from RGB images and depth maps. - **Dual - mutual - learning module**: Fuse RGB and depth features through the PMF and CMF modules to generate attention - weighted cross - modal RGB - D features. - **Cascade - transformer - injection - reconstruction decoder**: Decompose the Transformer network into four stages, independently input the fused features, and finally generate the final saliency map through the step - by - step decoding and reconstruction structure. ### Formula representation - **Position - mutual - fusion module**: \[ f_{\text{RGB}}^i=\text{Conv}_3(\text{Conv}_1(F_{\text{RGB}}^i)) \] \[ f_D^i = \text{Conv}_3(\text{Conv}_1(F_D^i)) \] \[ f_A^i = f_{\text{RGB}}^i + f_D^i \] \[ W_{\text{SP}}^i=\text{Conv}_7(\text{Cat}(\text{MaxPool}(f_A^i),\text{AvgPool}(f_A^i))) \] \[ f_{\text{SP}}^i=\text{Conv}_1(f_A^i\odot W_{\text{SP}}^i + f_A^i) \] \[ M_{s\text{RGB}}^i = M(f_{\text{SP}}^i\otimes(f_{\text{RGB}}^i)^T) \] \[ M_{sD}^i = M(f_{\text{SP}}^i\otimes(f_D^i)^T) \] \[ M_{s\text{Fu}}^i = M_{s\text{RGB}}