Abstract:The fusion of far infrared (FIR) and visible images aims to generate a high-quality composite image that contains salient structures and abundant texture details for human visual perception. However, the existing fusion methods typically fall short of utilizing complementary source image characteristics to boost the features extracted from degraded visible or FIR images, thus they cannot generate satisfactory fusion results in adverse lighting or weather conditions. In this paper, we propose a novel Cross-Modal multispectral image Enhancement and Fusion framework (CMEFusion), which adaptively enhances both FIR and visible inputs by leveraging complementary cross-modal features to further facilitate multispectral feature aggregation. Specifically, we first present a new cross-modal image enhancement sub-network (CMIENet), which is built on a CNN-Transformer hybrid architecture to perform the complementary exchange of local-salient and global-contextual features extracted from FIR and visible modalities, respectively. Then, we design a gradient-content differential fusion sub-network (GCDFNet) to progressively integrate decoupled gradient and content information via modified central difference convolution. Finally, we present a comprehensive joint enhancement-fusion multi-term loss function to drive the model to narrow the optimization gap between the above-mentioned two sub-networks based on the self-supervised aspects of exposure, color, structure, and intensity. In this manner, the proposed CMEFusion model facilitates better-performing visible and FIR image fusion in an end-to-end way, achieving enhanced visual quality with more natural and realistic appearances. Extensive experiments validate that CMEFusion surpasses state-of-the-art image fusion algorithms, as evidenced by superior performance in both visual quality and quantitative evaluations.

Cross-modal Fusion for Multi-Label Image Classification with Attention Mechanism

Multi-Scale Cross-Attention Fusion Network Based on Image Super-Resolution

Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image Classification

A multi-label image classification method combining multi-stage image semantic information and label relevance

MCFusion: infrared and visible image fusion based multiscale receptive field and cross-modal enhanced attention mechanism

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition

CMFA_Net: A cross-modal feature aggregation network for infrared-visible image fusion

A Cross-scale Iterative Attentional Adversarial Fusion Network for Infrared and Visible Images

MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid

Two-Stream Video Classification with Cross-Modality Attention

Cross Attention-Based Multi-Scale Convolutional Fusion Network for Hyperspectral and LiDAR Joint Classification

G-CAM: Graph Convolution Network Based Class Activation Mapping for Multi-label Image Recognition.

Double Attention Based on Graph Attention Network for Image Multi-Label Classification

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

CAT: Learning to Collaborate Channel and Spatial Attention from Multi-Information Fusion

TMCFN: Text-Supervised Multidimensional Contrastive Fusion Network for Hyperspectral and LiDAR Classification

Cross-Modal Image Fusion Theory Guided by Subjective Visual Attention

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery