Abstract:Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an l0-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the l0-regularized CSC problem, we develop an algorithm unrolling-based l0-regularized sparse coding (LZSC) block. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an l0-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet's training. Extensive experiments show that FNet achieves high-quality fusion results across five different MMIF tasks. Furthermore, we show that FNet enhances downstream object detection in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network.

What problem does this paper attempt to address?

This paper attempts to solve some key problems in multi - modal image fusion (MMIF) to improve the quality of fused images and the performance of downstream tasks. Specifically: 1. **Improve the quality of fused images**: By combining the unique and common features in images from different modal sensors, generate fused images with more information, thereby enhancing the effects of tasks such as visualization and object detection. 2. **Introduce ℓ₀ - regularized sparse coding**: In order to address the limitations of existing methods in estimating sparse features, this paper proposes a multi - modal convolutional sparse coding (MCSC) model based on ℓ₀ - regularization. Compared with the traditional ℓ₁ - regularization, ℓ₀ - regularization can estimate sparse features more accurately and avoid over - punishing features with large absolute values. 3. **Develop an interpretable network architecture**: To improve the interpretability of the model, the author designs a network named FNet, which is based on the proposed ℓ₀ - regularized sparse coding model and introduces a new LZSC block to solve the ℓ₀ - regularized CSC problem. In addition, an inverse fusion network IFNet is also proposed to constrain the decomposed source images to be similar to the original source images during the training process, thereby further improving the quality of the fused images. 4. **Solve the unsupervised training problem**: For fused images without ground - truth labels, this paper proposes a two - stage training method. In the first stage, FNet and IFNet are trained simultaneously to ensure that the source images generated by inverse fusion are similar to the original source images; in the second stage, only FNet is trained to optimize the model by maximizing the similarity between the fused images and the source images. Through these improvements, FNet has achieved leading results in multiple multi - modal image fusion tasks and also shown better performance in downstream tasks such as object detection of visible - light - thermal - infrared image pairs. ### Summary of main contributions: 1. **Developed the first learnable LZSC block**: Used to solve the ℓ₀ - regularized CSC problem. 2. **Proposed the MCSC model based on ℓ₀ - regularization**: Used to represent the multi - modal image fusion process. 3. **Designed the inverse fusion network IFNet**: Improved the quality of fused images. 4. **Achieved leading results in five MMIF tasks**: Including visible - infrared (VIS - IR), visible - near - infrared (VIS - NIR), CT - MRI, PET - MRI and SPECT - MRI image fusion. These innovations make FNet not only perform excellently in the quality of fused images but also show strong application potential in downstream tasks.

l0-Regularized Sparse Coding-based Interpretable Network for Multi-Modal Image Fusion

Deep Convolutional Sparse Coding Networks for Interpretable Image Fusion.

BCMFIFuse: A Bilateral Cross-Modal Feature Interaction-Based Network for Infrared and Visible Image Fusion

MSAIF-Net: A Multistage Spatial Attention-Based Invertible Fusion Network for MR Images.

Frequency Integration and Spatial Compensation Network for Infrared and Visible Image Fusion

A General Spatial-Frequency Learning Framework for Multimodal Image Fusion

Learning a Graph Neural Network with Cross Modality Interaction for Image Fusion

MSFNet: MultiStage Fusion Network for infrared and visible image fusion

LeGFusion: Locally Enhanced Global Learning for Multimodal Image Fusion

CMFA_Net: A cross-modal feature aggregation network for infrared-visible image fusion

SADFusion: A multi-scale infrared and visible image fusion method based on salient-aware and domain-specific

MIFFuse: A Multi-Level Feature Fusion Network for Infrared and Visible Images

LeGFusion: Locally-enhanced Global Learning for Multi-Modal Image Fusion

Correlation-Guided Discriminative Cross-Modality Features Network for Infrared and Visible Image Fusion

MM-Net: A MixFormer-Based Multi-Scale Network for Anatomical and Functional Image Fusion

CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion

Integrating Parallel Attention Mechanisms and Multi-Scale Features for Infrared and Visible Image Fusion

IGNFusion: An Unsupervised Information Gate Network for Multimodal Medical Image Fusion

DCTNet: A Heterogeneous Dual-Branch Multi-Cascade Network for Infrared and Visible Image Fusion

IR-MSDNet: Infrared and Visible Image Fusion Based On Infrared Features and Multiscale Dense Network

FDNet: An end-to-end fusion decomposition network for infrared and visible images