Abstract:In recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image–text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.

Fine-Grained Cross-Modal Retrieval with Triple-Streamed Memory Fusion Transformer Encoder

Embrace Smaller Attention: Efficient Cross-Modal Matching with Dual Gated Attention Fusion

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Feature Fusion Based on Transformer for Cross-modal Retrieval

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

Iterative graph attention memory network for cross-modal retrieval

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

Two-Stream Video Classification with Cross-Modality Attention

Multimodal Fusion Method Based on Self-Attention Mechanism

Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Cascaded Multi-3D-view Fusion for 3D-Oriented Object Detection

Cross‐modal retrieval with dual multi‐angle self‐attention

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

MCFusion: infrared and visible image fusion based multiscale receptive field and cross-modal enhanced attention mechanism

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion