Abstract:Remote sensing (RS) image-text retrieval (RSITR) aims to retrieve relevant texts (RS images) based on the content of a given RS image (text). Existing methods are used to employing the convolutional neural network (CNN) and recurrent neural network (RNN) as encoders to learn visual and textual features for retrieval. Although feasible, the global information hidden in different data does not receive the attention it deserves. To mitigate this problem, transformers have been introduced. Nevertheless, the complexity of RS images present challenges in directly introducing Transformer-based architectures to multimodal learning in RS scenes, particularly in visual feature extraction and cross-modal interaction. In addition, the textual captions are always simpler than the complex RS images, leading to a semantic description appearing in different images. This typical false-negative (FN) sample problem increases the difficulty of RSITR tasks. To address the above limitations, we propose a new RSITR model named prior-experience-based RS vision-language (PERSVL). First, the specific visual and text encoders are used to extract features from RS images and texts. Also, a high-level feature complement (HFC) module is developed based on the self-attention mechanism (SAM) for the visual encoder to explore the complex contents from RS images fully. Second, a dual-branch multimodal fusion encoder (DBMFE) is designed to complete the cross-modal learning. It comprises a dual-branch multimodal interaction (DBMI) module and a branch fusion module. DBMI is designed to fully explore the relationships between different modalities, enriching visual and textual features. The branch fusion module integrates the cross-modal features and utilizes a classification head to generate matching scores for retrieval. Finally, a learning from prior experiences (LPEs) module is designed to reduce the influence of FN samples by analyzing the historical data produced in the model training process. Experiments are conducted on three popular datasets, and the positive results show that our PERSVL model achieves superior performance compared with previous methods. By integrating the advantages of natural language and RS images, our PERSVL can be applied in various applications, such as environmental monitoring, disaster evaluation, and urban planning. Our source codes are available at: https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/PERSVL.

A Fusion-Based Contrastive Learning Model for Cross-Modal Remote Sensing Retrieval

Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval

Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal text-Image Retrieval

A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

Towards a multimodal framework for remote sensing image change retrieval and captioning

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval

Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

Feature Fusion Based on Transformer for Cross-modal Retrieval

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval