Abstract:Multimodal image fusion (MMIF) can provide more comprehensive scene characteristics by synthesizing a single image from multi-sensor images of the same scene, which works out the limitation of single-type hardware. To handle MMIF tasks, current deep learning (DL)-based methods usually use convolutional neural networks (CNNs) or combine transformer to extract local and global contextual information of source images. However, none of the existing works fully explores contextual information both across modalities and within single modalities, leading to limited fusion results. To this end, we propose a new MMIF method via locally enhanced global learning, termed as LeGFusion. Specifically, the network of LeGFusion is devised based on locally enhanced transformer block (LETB), which can capture long-range dependencies benefiting from nonoverlapping window-based self-attention while capturing useful local context with the utilization of the convolution operator into transformer. On one hand, several LETBs are deployed to extract global contexts from each modality while emphasizing its local information. On the other hand, the fusion module that also consists of LETBs is designed to integrate multimodal features by perceiving cross-modal local and global interactions. Powered by these intramodal and intermodal contextual information exploration, the proposed LeGFusion enjoys a high capability in capturing significant complementary information for image fusion. Extensive experiments are conducted on two types of MMIF tasks, including infrared-visible image fusion (IVF) and medical image fusion. The qualitative and quantitative evaluation results demonstrate the superiority of our LeGFusion over state-of-the-art methods. Furthermore, we validate the generalization ability of LeGFusion without fine-tuning and achieve fantastic results.

Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior

FusionDiff: Multi-focus image fusion using denoising diffusion probabilistic models

DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion

DCAFuse: Dual-Branch Diffusion-CNN Complementary Feature Aggregation Network for Multi-Modality Image Fusion

FusionDiff: A unified image fusion network based on diffusion probabilistic models

Dif-Fusion: Towards High Color Fidelity in Infrared and Visible Image Fusion with Diffusion Models

Dif-Fusion: Toward High Color Fidelity in Infrared and Visible Image Fusion With Diffusion Models

Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion

MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

DM-Fusion: Deep Model-Driven Network for Heterogeneous Image Fusion.

Equivariant Multi-Modality Image Fusion

DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

LeGFusion: Locally Enhanced Global Learning for Multimodal Image Fusion

DCFusion: Difference correlation-driven fusion mechanism of infrared and visible images

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion

LeGFusion: Locally-enhanced Global Learning for Multi-Modal Image Fusion

MM-Net: A MixFormer-Based Multi-Scale Network for Anatomical and Functional Image Fusion

MIFFuse: A Multi-Level Feature Fusion Network for Infrared and Visible Images

Conditional Controllable Image Fusion

Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model

Variational Diffusion Method for Remote Sensing Image Fusion