MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

Zhe Li,Haiwei Pan,Kejia Zhang,Yuhua Wang,Fengming Yu
2024-04-12
Abstract:Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image to represent the imaging scene and facilitate downstream visual tasks comprehensively. In recent years, significant progress has been made in MMIF tasks due to advances in deep neural networks. However, existing methods cannot effectively and efficiently extract modality-specific and modality-fused features constrained by the inherent local reductive bias (CNN) or quadratic computational complexity (Transformers). To overcome this issue, we propose a Mamba-based Dual-phase Fusion (MambaDFuse) model. Firstly, a dual-level feature extractor is designed to capture long-range features from single-modality images by extracting low and high-level features from CNN and Mamba blocks. Then, a dual-phase feature fusion module is proposed to obtain fusion features that combine complementary information from different modalities. It uses the channel exchange method for shallow fusion and the enhanced Multi-modal Mamba (M3) blocks for deep fusion. Finally, the fused image reconstruction module utilizes the inverse transformation of the feature extraction to generate the fused result. Through extensive experiments, our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion. Additionally, in a unified benchmark, MambaDFuse has also demonstrated improved performance in downstream tasks such as object detection. Code with checkpoints will be available after the peer-review process.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the problem of multimodal image fusion (MMIF), which integrates complementary information from different modal images to generate high-quality fusion images. Existing methods have limitations in efficiency and effectiveness in extracting modality-specific and fusion features, particularly the local restoration bias of convolutional neural networks (CNN) and the high computational complexity of Transformers. To address these issues, the paper proposes a Mamba-based dual-phase fusion model (MambaDFuse). MambaDFuse includes three stages: dual-level feature extraction, dual-phase feature fusion, and fusion image reconstruction. 1. Dual-level feature extraction: combining CNN and Mamba blocks to extract low-level and high-level features, using CNN to capture local features for early visual tasks, and Mamba to extract long-range features. 2. Dual-phase feature fusion: the shallow fusion module uses channel swapping to fuse global overview features, while the deep fusion module performs cross-modal depth feature fusion using the enhanced multimodal Mamba (M3) block to obtain local detailed features. 3. Fusion image reconstruction: using the inverse transformation of feature extraction to generate fusion results. Experiments show promising results of MambaDFuse in infrared-visible image fusion and medical image fusion tasks, as well as superior performance in downstream tasks such as object detection. Compared to existing methods, MambaDFuse achieves improvements in efficiency and effectiveness. Therefore, the main contribution of the paper lies in the first application of Mamba to the MMIF task, designing an effective feature extraction and fusion mechanism, providing a new solution for multimodal image fusion.