Abstract:Multi-modal image fusion aims to combine information from different modes to create a single image with comprehensive information and detailed textures. However, fusion models based on convolutional neural networks encounter limitations in capturing global image features due to their focus on local convolution operations. Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity. Recently, the Selective Structured State Space Model has exhibited significant potential for long-range dependency modeling with linear complexity, offering a promising avenue to address the aforementioned dilemma. In this paper, we propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba. Specifically, we devise an improved efficient Mamba model for image fusion, integrating efficient visual state space model with dynamic convolution and channel attention. This refined model not only upholds the performance of Mamba and global modeling capability but also diminishes channel redundancy while enhancing local enhancement capability. Additionally, we devise a dynamic feature fusion module (DFFM) comprising two dynamic feature enhancement modules (DFEM) and a cross modality fusion mamba module (CMFM). The former serves for dynamic texture enhancement and dynamic difference perception, whereas the latter enhances correlation features between modes and suppresses redundant intermodal information. FusionMamba has yielded state-of-the-art (SOTA) performance across various multimodal medical image fusion tasks (CT-MRI, PET-MRI, SPECT-MRI), infrared and visible image fusion task (IR-VIS) and multimodal biomedical image fusion dataset (GFP-PC), which is proved that our model has generalization ability. The code for FusionMamba is available at

What problem does this paper attempt to address?

The paper aims to address key issues in multimodal image fusion, particularly in the field of medical image fusion. Specifically, existing methods based on Convolutional Neural Networks (CNN) and Transformer architectures have limitations in capturing global features and local details. The paper proposes a new dynamic feature enhancement model, FusionMamba, to improve these issues by leveraging the advantages of the Mamba model. ### Main Issues 1. **Limitations of Existing Methods**: - Methods based on CNN are limited in capturing global features because they rely on local convolution operations. - Methods based on Transformers, while good at global feature modeling, have high computational complexity and are not as effective as CNNs in capturing local details. 2. **Insufficient Feature Fusion**: - Current fusion methods fail to effectively extract features from different modalities, leading to decreased fusion performance. ### Solution The paper proposes the FusionMamba model, which aims to address the above issues through the following aspects: 1. **Dynamic Feature Enhancement Module (DFEM)**: This module can dynamically enhance texture detail information in source images and perceive differences between different modalities. 2. **Cross-Modal Fusion Mamba Module (CMFM)**: This module effectively mines relevant features between different modalities and suppresses redundant inter-modal information. 3. **Dynamic Visual State Space Module (DVSS)**: This module improves the standard Mamba model by enhancing local feature extraction capabilities and reducing channel redundancy. With these improvements, FusionMamba achieves better performance in various multimodal image fusion tasks, including infrared and visible image fusion, CT and MRI image fusion, PET and MRI image fusion, and biomedical image fusion. Experimental results show that FusionMamba outperforms existing state-of-the-art techniques across multiple evaluation metrics.

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

Fusion-Mamba for Cross-modality Object Detection

A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion

MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification

MDC-RHT: Multi-Modal Medical Image Fusion via Multi-Dimensional Dynamic Convolution and Residual Hybrid Transformer

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

MMR-Mamba: Multi-Modal MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion

Mutual-Guided Dynamic Network for Image Fusion

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion

EMOST: A dual-branch hybrid network for medical image fusion via efficient model module and sparse transformer

MedMamba: Vision Mamba for Medical Image Classification

SIMFusion: A Semantic Information-Guided Modality-Specific Fusion Network for MR Images