Abstract:Deep learning has been successfully applied to infrared and visible image fusion due to its powerful ability of feature representation. Existing most deep learning-based infrared and visible image fusion methods mainly utilize pure convolution model or pure transformer model, which leads to that the fused image cannot preserve long-range dependences (global context) and local features simultaneously. To this end, we propose a convolution-guided transformer framework for infrared and visible image fusion (CGTF), which aims to combine the local features of convolutional network and the long-range dependence features of transformer to produce satisfactory fused image. In CGTF, the local features are calculated by convolution feature extraction module (CFEM), and then, the local features are used to guide the transformer feature extraction module (TFEM) to capture the long-range dependences of the image, which can overcome not only the lack of long-range dependences that exist in convolutional fusion methods but also the deficiency of local feature that exists in transformer models. Moreover, the convolution-guided transformer fusion framework can consider the inherent relationship of local feature and long-range dependences due to the alternate use of CFEM and transformer module. In addition, to strengthen local feature propagation, we employ dense connections among CFEMs. Ablation experiments demonstrate the effectiveness of convolution-guided transformer fusion framework and loss function. We employ two datasets to compare our method with other nine methods, which include three traditional methods, five deep learning-based methods, and one transformer-based method. Qualitative and quantitative experiments demonstrate the advantages of our method.

MFT: Multi-scale Fusion Transformer for Infrared and Visible Image Fusion

Infrared and Visible Image Fusion Based on Multiscale Adaptive Transformer

MFST: Multi-Modal Feature Self-Adaptive Transformer for Infrared and Visible Image Fusion

THFuse: An Infrared and Visible Image Fusion Network using Transformer and Hybrid Feature Extractor

MFTCFNet: Infrared and Visible Image Fusion Network Based on Multi-Layer Feature Tightly Coupled

MGT: Modality-Guided Transformer for Infrared and Visible Image Fusion.

HDCTfusion: Hybrid Dual-Branch Network Based on CNN and Transformer for Infrared and Visible Image Fusion

TCCFusion: An Infrared and Visible Image Fusion Method based on Transformer and Cross Correlation

MCFusion: infrared and visible image fusion based multiscale receptive field and cross-modal enhanced attention mechanism

SFPFusion: An Improved Vision Transformer Combining Super Feature Attention and Wavelet-Guided Pooling for Infrared and Visible Images Fusion

A Dual Cross Attention Transformer Network for Infrared and Visible Image Fusion

GTMFuse: Group-Attention Transformer-Driven Multiscale Dense Feature-Enhanced Network for Infrared and Visible Image Fusion

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

AFT: Adaptive Fusion Transformer for Visible and Infrared Images

A Fusion Framework for Infrared and Visible Images Based on CNN and MST

HitFusion: Infrared and Visible Image Fusion for High-Level Vision Tasks Using Transformer

MAFusion: Multiscale Attention Network for Infrared and Visible Image Fusion

DATFuse: Infrared and Visible Image Fusion via Dual Attention Transformer

CGTF: Convolution-Guided Transformer for Infrared and Visible Image Fusion

MGFCTFuse: A Novel Fusion Approach for Infrared and Visible Images

Image Fusion Using a Multi-Level Image Decomposition and Fusion Method.