SDTFusion: A split-head dense transformer based network for infrared and visible image fusion

Shan Pang,Hongtao Huo,Xiaowen Liu,Bowen Zheng,Jing Li
DOI: https://doi.org/10.1016/j.infrared.2024.105209
IF: 2.997
2024-02-05
Infrared Physics & Technology
Abstract:Most of the current deep learning based image fusion methods heavily rely on convolutional operations for feature extraction. Recently, some Transformer-based image fusion models have emerged. However, most of them design complex attention mechanisms and still rely heavily on convolutions for local features modeling. With this goal, this paper proposes a novel and simple split-head dense Transformer based infrared and visible image fusion network, termed as SDTFusion. It consists of three parts: the feature extraction module, the inter-gating fusion module and the reconstruction module. Particularly, the feature extraction module is a pure Transformer network where an interactive split-head attention mechanism is designed to model the uni-modal and cross-modal long-range dependencies and promote cross-modal information extraction. Dense connections between Transformer blocks facilitate the reusability of feature maps. In the fusion module, the inter-gating mechanism is formulated as the element-wise product of cross-modal information, which can well retain competitive infrared brightness and distinct visible details. Moreover, a learnable detail injection module built on cross-attention mechanism injects fine-grained bi-modal information into multiple layers of the reconstruction module. Extensive experiments performed on three benchmark datasets show that SDTFusion achieves surprising fusion performance compared with nine state-of-the-art methods. In addition, the dominant capabilities of semantic segmentation and object detection also reveal the great advantage of our framework in promoting downstream visual tasks.
optics,physics, applied,instruments & instrumentation
What problem does this paper attempt to address?