HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

Xiangzeng Liu,Ziyao Wang,Haojie Gao,Xiang Li,Lei Wang,Qiguang Miao
DOI: https://doi.org/10.3390/rs16050803
IF: 5
2024-02-26
Remote Sensing
Abstract:Current CNN-based methods for infrared and visible image fusion are limited by the low discrimination of extracted structural features, the adoption of uniform loss functions, and the lack of inter-modal feature interaction, which make it difficult to obtain optimal fusion results. To alleviate the above problems, a framework for multimodal feature learning fusion using a cross-attention Transformer is proposed. To extract rich structural features at different scales, residual U-Nets with mixed receptive fields are adopted to capture salient object information at various granularities. Then, a hybrid attention fusion strategy is employed to integrate the complementing information from the input images. Finally, adaptive loss functions are designed to achieve optimal fusion results for different modal features. The fusion framework proposed in this study is thoroughly evaluated using the TNO, FLIR, and LLVIP datasets, encompassing diverse scenes and varying illumination conditions. In the comparative experiments, HATF achieved competitive results on three datasets, with EN, SD, MI, and SSIM metrics reaching the best performance on the TNO dataset, surpassing the second-best method by 2.3%, 18.8%, 4.2%, and 2.2%, respectively. These results validate the effectiveness of the proposed method in terms of both robustness and image fusion quality compared to several popular methods.
environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in infrared and visible - light image fusion, some limitations exist in current methods based on convolutional neural networks (CNNs), such as low discrimination of extracted structural features, the use of unified loss functions, and the lack of cross - modal feature interaction. These problems make it difficult to obtain optimal fusion results. To solve the above problems, the authors propose a multi - modal feature learning framework (HATF) based on the hybrid - attention Transformer to achieve efficient fusion of infrared and visible - light images. Specifically, the main contributions of the paper are as follows: 1. **Residual U - Net Block (RUB)**: Use RUB in each encoding block. Through the nested U - shaped structure, the network can capture rich local and global information at different scales simultaneously. 2. **Hybrid Attention Mechanism**: Construct an intra - domain self - attention and cross - domain cross - attention mechanism. First, use self - attention to extract global information from single - modal images, and then obtain the interaction information of bimodal images through cross - attention, thereby seamlessly integrating the complementary information of infrared and visible - light features to generate more informative fusion images. 3. **Adaptive Fusion Loss Function**: Design an adaptive fusion loss function that combines different modal features, including structural similarity loss, multi - modal feature loss, and saliency loss, to achieve high - quality image fusion. Through these innovations, the paper aims to improve the quality and robustness of infrared and visible - light image fusion, especially the performance in dealing with different lighting conditions and complex scenes.