Abstract:Current CNN-based methods for infrared and visible image fusion are limited by the low discrimination of extracted structural features, the adoption of uniform loss functions, and the lack of inter-modal feature interaction, which make it difficult to obtain optimal fusion results. To alleviate the above problems, a framework for multimodal feature learning fusion using a cross-attention Transformer is proposed. To extract rich structural features at different scales, residual U-Nets with mixed receptive fields are adopted to capture salient object information at various granularities. Then, a hybrid attention fusion strategy is employed to integrate the complementing information from the input images. Finally, adaptive loss functions are designed to achieve optimal fusion results for different modal features. The fusion framework proposed in this study is thoroughly evaluated using the TNO, FLIR, and LLVIP datasets, encompassing diverse scenes and varying illumination conditions. In the comparative experiments, HATF achieved competitive results on three datasets, with EN, SD, MI, and SSIM metrics reaching the best performance on the TNO dataset, surpassing the second-best method by 2.3%, 18.8%, 4.2%, and 2.2%, respectively. These results validate the effectiveness of the proposed method in terms of both robustness and image fusion quality compared to several popular methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in infrared and visible - light image fusion, some limitations exist in current methods based on convolutional neural networks (CNNs), such as low discrimination of extracted structural features, the use of unified loss functions, and the lack of cross - modal feature interaction. These problems make it difficult to obtain optimal fusion results. To solve the above problems, the authors propose a multi - modal feature learning framework (HATF) based on the hybrid - attention Transformer to achieve efficient fusion of infrared and visible - light images. Specifically, the main contributions of the paper are as follows: 1. **Residual U - Net Block (RUB)**: Use RUB in each encoding block. Through the nested U - shaped structure, the network can capture rich local and global information at different scales simultaneously. 2. **Hybrid Attention Mechanism**: Construct an intra - domain self - attention and cross - domain cross - attention mechanism. First, use self - attention to extract global information from single - modal images, and then obtain the interaction information of bimodal images through cross - attention, thereby seamlessly integrating the complementary information of infrared and visible - light features to generate more informative fusion images. 3. **Adaptive Fusion Loss Function**: Design an adaptive fusion loss function that combines different modal features, including structural similarity loss, multi - modal feature loss, and saliency loss, to achieve high - quality image fusion. Through these innovations, the paper aims to improve the quality and robustness of infrared and visible - light image fusion, especially the performance in dealing with different lighting conditions and complex scenes.

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

MFST: Multi-Modal Feature Self-Adaptive Transformer for Infrared and Visible Image Fusion

Infrared and Visible Image Fusion Based on a Two-Stage Class Conditioned Auto-Encoder Network.

Multi-scale attention-based lightweight network with dilated convolutions for infrared and visible image fusion

HDCCT: Hybrid Densely Connected CNN and Transformer for Infrared and Visible Image Fusion

Multi-scale unsupervised network for infrared and visible image fusion based on joint attention mechanism

A Deep Learning Framework for Infrared and Visible Image Fusion Without Strict Registration

GTMFuse: Group-Attention Transformer-Driven Multiscale Dense Feature-Enhanced Network for Infrared and Visible Image Fusion

A Multi-Stage Visible and Infrared Image Fusion Network Based on Attention Mechanism

Fusion of Infrared and Visible Images Via Multi-Layer Convolutional Sparse Representation

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

DATFuse: Infrared and Visible Image Fusion via Dual Attention Transformer

Integrating Parallel Attention Mechanisms and Multi-Scale Features for Infrared and Visible Image Fusion

HitFusion: Infrared and Visible Image Fusion for High-Level Vision Tasks Using Transformer

Rethinking Cross-Attention for Infrared and Visible Image Fusion

Visible and Infrared Image Fusion Based on Attention and Multiscale Residuals

HDCTfusion: Hybrid Dual-Branch Network Based on CNN and Transformer for Infrared and Visible Image Fusion

TCCFusion: An Infrared and Visible Image Fusion Method based on Transformer and Cross Correlation

SFPFusion: An Improved Vision Transformer Combining Super Feature Attention and Wavelet-Guided Pooling for Infrared and Visible Images Fusion

A Cross-scale Iterative Attentional Adversarial Fusion Network for Infrared and Visible Images

MAFusion: Multiscale Attention Network for Infrared and Visible Image Fusion