Abstract:Image fusion is famous as an alternative solution to generate one high-quality image from multiple images in addition to image restoration from a single degraded image. The essence of image fusion is to integrate complementary information from source images. Existing fusion methods struggle with generalization across various tasks and often require labor-intensive designs, in which it is difficult to identify and extract useful information from source images due to the diverse requirements of each fusion task. Additionally, these methods develop highly specialized features for different downstream applications, hindering the adaptation to new and diverse downstream tasks. To address these limitations, we introduce DeFusion++, a novel framework that leverages self-supervised learning (SSL) to enhance the versatility of feature representation for different image fusion tasks. DeFusion++ captures the image fusion task-friendly representations from large-scale data in a self-supervised way, overcoming the constraints of limited fusion datasets. Specifically, we introduce two innovative pretext tasks: common and unique decomposition (CUD) and masked feature modeling (MFM). CUD decomposes source images into abstract common and unique components, while MFM refines these components into robust fused features. Jointly training of these tasks enables DeFusion++ to produce adaptable representations that can effectively extract useful information from various source images, regardless of the fusion task. The resulting fused representations are also highly adaptable for a wide range of downstream tasks, including image segmentation and object detection. DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching

Fusion Layer Attention for Image-Text Matching.

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching

Feature Fusion Based on Transformer for Cross-modal Retrieval

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

A Task-guided, Implicitly-searched and Metainitialized Deep Model for Image Fusion

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

Query Adaptive Late Fusion for Image Retrieval.

Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond