Abstract:Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging topic in the remote-sensing (RS) community. It has gained growing attention because it can be flexibly used in many practical applications. In the current deep era, with the help of deep convolutional neural networks (DCNNs), many successful CMRSITR methods have been proposed. Most of them first learn valuable features from RS images and texts, respectively. Then, the obtained visual and textual features are mapped into a common space for the final retrieval. The above operations are feasible; however, two difficulties are still to be solved. One is that the semantics within the visual and textual features are misaligned due to the independent learning manner. The other one is that the deep links between RS images and texts cannot be fully explored by simple common space mapping. To overcome the above challenges, we propose a new model named interacting-enhancing feature transformer (IEFT) for CMRSITR, which regards the RS images and texts as a whole. First, a simple feature embedding module (FEM) is developed to map images and texts into the visual and textual feature spaces. Second, an information interacting-enhancing module (IIEM) is designed to simultaneously model the inner relationships between RS images and texts and enhance the visual features. IIEM consists of three feature interacting-enhancing (FIE) blocks, each of which contains an intermodality relationship interacting (IMRI) subblock and a visual feature enhancing (VFE) subblock. The duty of IMRI is to exploit the hidden relations between cross-modal data, while the responsibility of VFE is to improve the visual features. By combining them, semantic bias can be mitigated, and the complex contents of RS images can be studied. Finally, the retrieval module (RM) is constructed to generate the matching scores for deciding the search results. Extensive experiments are conducted on four public RS datasets. The positive results demonstrate that our IEFT can achieve superior retrieval performance compared with many existing methods. Our source codes are available at https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/IEFT.

Multiscale Salient Alignment Learning for Remote-Sensing Image–Text Retrieval

Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval

SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images

Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Visual Global-Salient-Guided Network for Remote Sensing Image-Text Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval

A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing

Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval

Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal text-Image Retrieval