MambaSOD: Dual Mamba-Driven Cross-Modal Fusion Network for RGB-D Salient Object Detection

Yue Zhan,Zhihong Zeng,Haijun Liu,Xiaoheng Tan,Yinli Tian
2024-10-19
Abstract:The purpose of RGB-D Salient Object Detection (SOD) is to pinpoint the most visually conspicuous areas within images accurately. While conventional deep models heavily rely on CNN extractors and overlook the long-range contextual dependencies, subsequent transformer-based models have addressed the issue to some extent but introduce high computational complexity. Moreover, incorporating spatial information from depth maps has been proven effective for this task. A primary challenge of this issue is how to fuse the complementary information from RGB and depth effectively. In this paper, we propose a dual Mamba-driven cross-modal fusion network for RGB-D SOD, named MambaSOD. Specifically, we first employ a dual Mamba-driven feature extractor for both RGB and depth to model the long-range dependencies in multiple modality inputs with linear complexity. Then, we design a cross-modal fusion Mamba for the captured multi-modal features to fully utilize the complementary information between the RGB and depth features. To the best of our knowledge, this work is the first attempt to explore the potential of the Mamba in the RGB-D SOD task, offering a novel perspective. Numerous experiments conducted on six prevailing datasets demonstrate our method's superiority over sixteen state-of-the-art RGB-D SOD models. The source code will be released at <a class="link-external link-https" href="https://github.com/YueZhan721/MambaSOD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key problems in RGB - D salient object detection (RGB - D SOD): 1. **Long - range dependency modeling**: - Traditional deep - learning models mainly rely on convolutional neural network (CNN) extractors, but these models perform poorly when dealing with long - range context dependencies. - Although Transformer - based models have solved this problem to a certain extent, they introduce high computational complexity. 2. **Multi - modal feature fusion**: - How to effectively fuse the complementary information in RGB and depth maps is a major challenge. Existing methods usually adopt early - fusion, mid - fusion or late - fusion strategies, but these methods are insufficient in modeling long - range dependencies of cross - modal correlations. 3. **Computational efficiency**: - Although Transformer - based methods can capture global information, due to the quadratic growth property of the self - attention mechanism, they lead to high computational complexity and are difficult to be deployed on resource - limited devices. ### Solutions To solve the above problems, the authors propose a dual - Mamba - driven cross - modal fusion network (MambaSOD), with the following specific contributions: 1. **Dual - Mamba - driven feature extractor**: - Use the Mamba architecture as the backbone network to perform feature extraction on RGB and depth maps respectively. The Mamba architecture can effectively model long - range dependencies in images while maintaining linear complexity. 2. **Cross - modal fusion Mamba module**: - Design a cross - modal fusion Mamba module (CMM) to model long - range dependencies of cross - modal correlations and enhance modality - specific features. By projecting the features of the two modalities into a shared space, the learning of complementary features is achieved. 3. **Multi - level refinement decoder**: - Adopt a multi - level refinement module (MR) to aggregate the fused RGB - D features in order to predict accurate saliency maps. ### Experimental results The authors conducted a large number of experiments on six popular RGB - D SOD benchmark datasets. The results show that MambaSOD outperforms the existing 16 state - of - the - art RGB - D SOD models in multiple metrics. The specific metrics include F - measure (Fβ), E - measure (Eξ), S - measure (Sα) and Mean Absolute Error (MAE). ### Formulas 1. **Binary cross - entropy loss function**: \[ L_{\text{BCE}}(P, G)=G\cdot\log(P)+(1 - G)\cdot\log(1 - P) \] where \( P \) represents the predicted probability value and \( G \) represents the ground - truth label. 2. **Total loss function**: \[ L_{\text{total}}=\sum_{i = 1}^{5}L_{\text{BCE}}(P_i, G) \] where \( P_i \) represents the saliency map predicted at the \( i \)-th layer in the decoding stage. Through these methods and experimental results, the paper demonstrates the superior performance of MambaSOD in the RGB - D salient object detection task.