Equivariant Multi-Modality Image Fusion

Zixiang Zhao,Haowen Bai,Jiangshe Zhang,Yulun Zhang,Kai Zhang,Shuang Xu,Dongdong Chen,Radu Timofte,Luc Van Gool
2024-04-16
Abstract:Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at <a class="link-external link-https" href="https://github.com/Zhaozixiang1228/MMIF-EMMA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address several key issues in multi-modality image fusion: 1. **Lack of real fused images**: Due to the absence of a "super sensor" that can simultaneously perceive all modality information in reality, there is a lack of real fused images as supervision data during training. This makes traditional supervised learning methods difficult to apply. 2. **Limitations of existing methods**: Methods based on generative models and manually designed loss functions suffer from poor interpretability, lack of controllability, and training difficulties. Additionally, methods based on manually designed loss functions ignore the potential domain differences between the fused image and the source images, leading to suboptimal fusion results. To address the above issues, the paper proposes a new self-supervised learning framework named EMMA (Equivariant Multi-Modality Image Fusion). This framework establishes a pseudo-sensing module by simulating the natural perception imaging process and introduces non-specific domain equivariant prior knowledge to effectively constrain and guide the fused image. Specifically, EMMA consists of the following main components: - **Fusion Module**: Adopts a U-Net-like structure combined with Restormer-CNN blocks to generate the fused image. - **Pseudo-Sensing Module**: Based on U-Net, it maps the fused image back to the source images, simulating the natural perception imaging process. - **Equivariant Fusion Module**: Ensures that the fused image conforms to the equivariant imaging prior. In this way, EMMA can perform effective self-supervised learning without real fused images and achieve excellent results in infrared-visible image fusion and medical image fusion tasks. It also enhances the performance of downstream multi-modality object detection and semantic segmentation tasks.