Abstract:With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.

Automatic Captioning Based on Visible and Infrared Images

Fusion of Low-Illuminance Visible and Near-Infrared Images Based on Convolutional Neural Networks

Fusion of infrared and visual images through multiscale hybrid unidirectional total variation

Infrared and Visible Image Fusion Via Variational Bayesian Approximation Method

Infrared Image Captioning with Wearable Device.

Fusion of Infrared and Visible Images Via Multi-Layer Convolutional Sparse Representation

CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain

Infrared and Visible Image Fusion Based on a Two-Stage Class Conditioned Auto-Encoder Network.

A robust infrared and visible image fusion framework via multi-receptive-field attention and color visual perception

Infrared and visible image fusion based on infrared background suppression

Infrared Image Captioning Based on Unsupervised Learning and Reinforcement Learning

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

Infrared and Visible Image Fusion with Convolutional Neural Networks.

Infrared-visible Image Fusion Based on Regional Attention Auto-Encoder

Adaptive low light visual enhancement and high-significant target detection for infrared and visible image fusion

Infrared and visible image fusion method based on visual saliency objects and fuzzy region attributes

Infrared-visible Image Fusion Using Accelerated Convergent Convolutional Dictionary Learning

Visible and Near Infrared Image Fusion Based on Texture Information

EV-Fusion: A Novel Infrared and Low-Light Color Visible Image Fusion Network Integrating Unsupervised Visible Image Enhancement

Infrared and Visible Image Fusion Based on Filtering Enhancement

Fusion of Infrared and Visible Images based on Spatial-Channel Attentional Mechanism