SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval
Zicong Zhu,Jian Kang,Wenhui Diao,Yingchao Feng,Junxi Li,Jingen Ni
DOI: https://doi.org/10.1109/tgrs.2024.3402216
IF: 8.2
2024-06-04
IEEE Transactions on Geoscience and Remote Sensing
Abstract:The essence of improving the effect of cross-modal image–text retrieval (CIR) lies in the finer-grained modeling of homogeneous features between modalities. However, in remote sensing (RS) scenarios, existing methods usually apply the image–sentence granular feature alignment paradigm, bringing significant difficulties to the fine-grained representation of homogeneous features between modalities. Besides, more complex background noise and extreme scale ranges of foreground targets are hard to distinguish, causing the feature mottle problem. To address the above issues, we propose a novel Semantic-guided Image–text Retrieval framework with Segmentation (SIRS). It is a multitask joint learning framework for plug-and-play and end-to-end training RS CIR models efficiently, including semantic-guided spatial attention (SSA) and adaptive multiscale weighting (AMW) modules. First, SSA introduces a background reconstruction (BR) branch based on noise perception and a semantic segmentation (SS) branch based on pixel-level prediction. It explores a joint learning strategy that concisely filters background noise and refines foreground features considerably. Second, AMW performs multiscale weighting on various layers of feature map output by the encoder, effectively improving the learning efficiency of foreground targets at different scales. It is worth mentioning that SIRS outputs combination results with image and segmentation mask, which is not available in other methods. Based on the RSITMD dataset, we complete the SS annotation RSITMD-SS to verify the performance of the proposed method. Sufficient and complete experiments verify the effectiveness of the proposed method. With SIRS, the mainstream SVP and CLIP-based methods improve about 7 mR and derive segmentation prediction with acceptable computational cost optionally. The code and associated dataset will be available at https://github.com/StarBurstStream0/SIRS.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics