Abstract:Infrared and visible image fusion has been developed from vision perception oriented fusion methods to strategies which both consider the vision perception and high-level vision task. However, the existing task-driven methods fail to address the domain gap between semantic and geometric representation. To overcome these issues, we propose a high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation, terms as HSFusion. Specifically, to minimize the gap between semantic and geometric representation, we design two separate domain transformation branches by CycleGAN framework, and each includes two processes: the forward segmentation process and the reverse reconstruction process. CycleGAN is capable of learning domain transformation patterns, and the reconstruction process of CycleGAN is conducted under the constraint of these patterns. Thus, our method can significantly facilitate the integration of semantic and geometric information and further reduces the domain gap. In fusion stage, we integrate the infrared and visible features that extracted from the reconstruction process of two seperate CycleGANs to obtain the fused result. These features, containing varying proportions of semantic and geometric information, can significantly enhance the high level vision tasks. Additionally, we generate masks based on segmentation results to guide the fusion task. These masks can provide semantic priors, and we design adaptive weights for two distinct areas in the masks to facilitate image fusion. Finally, we conducted comparative experiments between our method and eleven other state-of-the-art methods, demonstrating that our approach surpasses others in both visual appeal and semantic segmentation task.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper "HSFusion: High-Level Vision Task-Driven Infrared and Visible Image Fusion Network Based on Semantic and Geometric Domain Transformation" aims to address the following issues: 1. **Domain Gap Between Semantic and Geometric Representations**: - Existing task-driven methods fail to effectively address the domain gap between semantic and geometric representations when fusing infrared and visible images. This leads to poor performance of the fusion results in high-level vision tasks (e.g., semantic segmentation). 2. **Joint Optimization of Fusion and High-Level Vision Tasks**: - Traditional image fusion methods usually focus only on visual perception, neglecting the needs of high-level vision tasks. To improve the effectiveness of fusion results in high-level vision tasks, a method that can simultaneously optimize fusion and high-level vision tasks is needed. 3. **Complementarity of Different Modal Information**: - Visible light sensors can clearly capture the texture details of objects but are easily affected by extreme conditions (e.g., darkness, strong light, or rain and fog). Infrared sensors capture information through thermal radiation, excelling at capturing object contours and being robust to environmental changes, but lack detailed texture information. Therefore, a method is needed to integrate information from these two modalities to meet the needs of visual perception and high-level vision tasks. ### Solution To address the above issues, the authors propose a high-level vision task-driven infrared and visible image fusion network based on semantic and geometric domain transformation (HSFusion). Specifically: 1. **Dual Independent Pre-trained Feature Extractors**: - Two separate CycleGAN frameworks are used as feature extractors to process infrared and visible images, respectively. Each CycleGAN framework includes a forward segmentation process and a backward reconstruction process to learn stable domain transformation patterns. 2. **Adaptive Feature Fusion Network**: - During the fusion stage, masks are generated based on segmentation results, and an adaptive weighting strategy is designed to focus more on infrared features in thermal source areas and visible features in non-thermal source areas during the fusion process. 3. **Semantic Segmentation-Guided Fusion**: - The masks generated from semantic segmentation results guide the fusion process, enhancing the complementary semantic priors of different source images, thereby improving the performance of fusion and high-level vision tasks. ### Main Contributions 1. **Comprehensive Extraction of Semantic and Geometric Information**: - By using two independent pre-trained feature extractors, the semantic and geometric information of infrared and visible images is fully extracted, not only improving visual perception but also enhancing the semantic representation of the fusion results. 2. **Minimizing the Domain Gap Between Semantic and Geometric Information**: - The CycleGAN structure is used to learn the latent transformation patterns of different domains, integrating semantic and geometric information under these constraints. 3. **Guiding the Fusion Process with Semantic Masks**: - The generated semantic masks enhance the complementary semantic priors of different source images, further improving the performance of fusion and high-level vision tasks. 4. **Experimental Validation**: - Experimental results show that HSFusion achieves state-of-the-art performance in both visual perception and high-level semantic segmentation tasks. Through these methods, HSFusion effectively addresses the issues present in existing methods when fusing infrared and visible images, enhancing the application value of fusion results in high-level vision tasks.

HSFusion: A high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation

Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network

SIGFusion: Semantic Information-Guided Infrared and Visible Image Fusion

Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity

Fusion of Infrared and Visible Images Via Multi-Layer Convolutional Sparse Representation

Infrared and Visible Image Fusion Based on a Two-Stage Class Conditioned Auto-Encoder Network.

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

SCFusion: Infrared and Visible Fusion Based on Salient Compensation

SFPFusion: An Improved Vision Transformer Combining Super Feature Attention and Wavelet-Guided Pooling for Infrared and Visible Images Fusion

Semantic-Aware Fusion Network Based on Super-Resolution

SADFusion: A multi-scale infrared and visible image fusion method based on salient-aware and domain-specific

Distillation-fusion-semantic unified driven network for infrared and visible image fusion

A Multi-Stage Visible and Infrared Image Fusion Network Based on Attention Mechanism

SFCFusion: Spatial–Frequency Collaborative Infrared and Visible Image Fusion

HitFusion: Infrared and Visible Image Fusion for High-Level Vision Tasks Using Transformer

CHFusion: A Cross-modality High-resolution Representation Framework for Infrared and Visible Image Fusion

DCFusion: A Dual-Frequency Cross-Enhanced Fusion Network for Infrared and Visible Image Fusion.

IVGF: The Fusion-Guided Infrared and Visible General Framework

SFDFusion: An Efficient Spatial-Frequency Domain Fusion Network for Infrared and Visible Image Fusion

Infrared and Visible Image Fusion with Hierarchical Human Perception

SeGFusion: A semantic saliency guided infrared and visible image fusion method