Abstract:Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at <a class="link-external link-https" href="https://github.com/Zhaozixiang1228/MMIF-EMMA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address several key issues in multi-modality image fusion: 1. **Lack of real fused images**: Due to the absence of a "super sensor" that can simultaneously perceive all modality information in reality, there is a lack of real fused images as supervision data during training. This makes traditional supervised learning methods difficult to apply. 2. **Limitations of existing methods**: Methods based on generative models and manually designed loss functions suffer from poor interpretability, lack of controllability, and training difficulties. Additionally, methods based on manually designed loss functions ignore the potential domain differences between the fused image and the source images, leading to suboptimal fusion results. To address the above issues, the paper proposes a new self-supervised learning framework named EMMA (Equivariant Multi-Modality Image Fusion). This framework establishes a pseudo-sensing module by simulating the natural perception imaging process and introduces non-specific domain equivariant prior knowledge to effectively constrain and guide the fused image. Specifically, EMMA consists of the following main components: - **Fusion Module**: Adopts a U-Net-like structure combined with Restormer-CNN blocks to generate the fused image. - **Pseudo-Sensing Module**: Based on U-Net, it maps the fused image back to the source images, simulating the natural perception imaging process. - **Equivariant Fusion Module**: Ensures that the fused image conforms to the equivariant imaging prior. In this way, EMMA can perform effective self-supervised learning without real fused images and achieve excellent results in infrared-visible image fusion and medical image fusion tasks. It also enhances the performance of downstream multi-modality object detection and semantic segmentation tasks.

Equivariant Multi-Modality Image Fusion

Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion

Multi-Modal Image Fusion via Self-Supervised Transformer

LeGFusion: Locally-enhanced Global Learning for Multi-Modal Image Fusion

LeGFusion: Locally Enhanced Global Learning for Multimodal Image Fusion

Multimodal Image Fusion Via Self-Supervised Transformer

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion

Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior

MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

Deep Equilibrium Multimodal Fusion

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

EMEF: Ensemble Multi-Exposure Image Fusion

MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion

SIMFusion: A Semantic Information-Guided Modality-Specific Fusion Network for MR Images

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion

Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

A Multi-Stage Visible and Infrared Image Fusion Network Based on Attention Mechanism