Abstract:The essence of improving the effect of cross-modal image–text retrieval (CIR) lies in the finer-grained modeling of homogeneous features between modalities. However, in remote sensing (RS) scenarios, existing methods usually apply the image–sentence granular feature alignment paradigm, bringing significant difficulties to the fine-grained representation of homogeneous features between modalities. Besides, more complex background noise and extreme scale ranges of foreground targets are hard to distinguish, causing the feature mottle problem. To address the above issues, we propose a novel Semantic-guided Image–text Retrieval framework with Segmentation (SIRS). It is a multitask joint learning framework for plug-and-play and end-to-end training RS CIR models efficiently, including semantic-guided spatial attention (SSA) and adaptive multiscale weighting (AMW) modules. First, SSA introduces a background reconstruction (BR) branch based on noise perception and a semantic segmentation (SS) branch based on pixel-level prediction. It explores a joint learning strategy that concisely filters background noise and refines foreground features considerably. Second, AMW performs multiscale weighting on various layers of feature map output by the encoder, effectively improving the learning efficiency of foreground targets at different scales. It is worth mentioning that SIRS outputs combination results with image and segmentation mask, which is not available in other methods. Based on the RSITMD dataset, we complete the SS annotation RSITMD-SS to verify the performance of the proposed method. Sufficient and complete experiments verify the effectiveness of the proposed method. With SIRS, the mainstream SVP and CLIP-based methods improve about 7 mR and derive segmentation prediction with acceptable computational cost optionally. The code and associated dataset will be available at https://github.com/StarBurstStream0/SIRS.

Integrating Multi-subspace Joint Learning with Multi-level Guidance for Cross-Modal Retrieval of Remote Sensing Images

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images

A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text–Image Retrieval

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Multiscale Salient Alignment Learning for Remote-Sensing Image–Text Retrieval

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

SIRS: Multi-task Joint Learning for Remote Sensing Foreground-entity Image-text Retrieval

Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval

Visual Global-Salient-Guided Network for Remote Sensing Image-Text Retrieval

SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval

MCRN: A Multi-source Cross-modal Retrieval Network for Remote Sensing

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing

A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing

Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval

Towards Learning a Semantic-Consistent Subspace for Cross-Modal Retrieval.

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification