Abstract:In recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image–text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.

Multi-view inter-modality representation with progressive fusion for image-text matching

Image-Text Matching with Multi-View Attention

MURF: Mutually Reinforcing Multi-Modal Image Registration and Fusion

Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Progressive Deep Multi-View Comprehensive Representation Learning.

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

Interpretation on Multi-modal Visual Fusion

Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

A Multiview Text Imagination Network Based on Latent Alignment for Image-Text Matching

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Multiview adaptive attention pooling for image-text retrieval

Deep Embedded Complementary and Interactive Information for Multi-View Classification

Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-modality Image Fusion and Applications

Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching

Feature Fusion Based on Transformer for Cross-modal Retrieval

A Multi-Modal Image Fusion Framework Based on Guided Filter and Sparse Representation

New Insights into Multi-focus Image Fusion: A Fusion Method Based on Multi-dictionary Linear Sparse Representation and Region Fusion Model

Image Retrieval with Composed Query by Multi-Scale Multi-Modal Fusion.

Multi-view and region reasoning semantic enhancement for image-text retrieval

Cross-Modal Image-Recipe Retrieval Via Intra- and Inter-Modality Hybrid Fusion.