Abstract:The essence of improving the effect of cross-modal image–text retrieval (CIR) lies in the finer-grained modeling of homogeneous features between modalities. However, in remote sensing (RS) scenarios, existing methods usually apply the image–sentence granular feature alignment paradigm, bringing significant difficulties to the fine-grained representation of homogeneous features between modalities. Besides, more complex background noise and extreme scale ranges of foreground targets are hard to distinguish, causing the feature mottle problem. To address the above issues, we propose a novel Semantic-guided Image–text Retrieval framework with Segmentation (SIRS). It is a multitask joint learning framework for plug-and-play and end-to-end training RS CIR models efficiently, including semantic-guided spatial attention (SSA) and adaptive multiscale weighting (AMW) modules. First, SSA introduces a background reconstruction (BR) branch based on noise perception and a semantic segmentation (SS) branch based on pixel-level prediction. It explores a joint learning strategy that concisely filters background noise and refines foreground features considerably. Second, AMW performs multiscale weighting on various layers of feature map output by the encoder, effectively improving the learning efficiency of foreground targets at different scales. It is worth mentioning that SIRS outputs combination results with image and segmentation mask, which is not available in other methods. Based on the RSITMD dataset, we complete the SS annotation RSITMD-SS to verify the performance of the proposed method. Sufficient and complete experiments verify the effectiveness of the proposed method. With SIRS, the mainstream SVP and CLIP-based methods improve about 7 mR and derive segmentation prediction with acceptable computational cost optionally. The code and associated dataset will be available at https://github.com/StarBurstStream0/SIRS.

Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

SSCNet: A Spectrum-Space Collaborative Network for Semantic Segmentation of Remote Sensing Images

A Spectral–Spatial Context-Boosted Network for Semantic Segmentation of Remote Sensing Images

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval

DCP-Net: A Distributed Collaborative Perception Network for Remote Sensing Semantic Segmentation

SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images

Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images

Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Semantic Attention and Scale Complementary Network for Instance Segmentation in Remote Sensing Images

Exploring Models and Data for Remote Sensing Image Caption Generation

A New Semantic Segmentation Method for Remote Sensing Images Integrating Coordinate Attention and SPD-Conv

Progressive Scale-aware Network for Remote sensing Image Change Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Global Semantic-Sense Aggregation Network for Salient Object Detection in Remote Sensing Images