Multiscale Salient Alignment Learning for Remote-Sensing Image–Text Retrieval

Yaxiong Chen,Jinghao Huang,Xiaoyu Li,Shengwu Xiong,Xiaoqiang Lu
DOI: https://doi.org/10.1109/tgrs.2023.3340870
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote-sensing image–text (RSIT) retrieval involves the use of either textual descriptions or remote-sensing images (RSI) as queries to retrieve relevant RSIs or corresponding text descriptions. Many traditional cross-modal RSIT retrieval methods tend to overlook the importance of capturing salient information and establishing the prior similarity between RSIs and texts, leading to a decline in cross-modal retrieval performance. In this article, we address these challenges by introducing a novel approach known as multiscale salient image-guided text alignment (MSITA). This approach is designed to learn salient information by aligning text with images for effective cross-modal RSIT retrieval. The MSITA approach first incorporates a multiscale fusion module and a salient learning module to facilitate the extraction of salient information. In addition, it introduces an image-guided text alignment (IGTA) mechanism that uses image information to guide the alignment of texts, enabling the effective capture of fine-grained correspondences between RSI regions and textual descriptions. In addition to these components, a novel loss function is devised to enhance the similarity across different modalities and reinforce the prior similarity between RSIs and texts. Extensive experiments conducted on four widely adopted RSIT datasets affirm that the MSITA approach significantly enhances cross-modal RSIT retrieval performance in comparison to other state-of-the-art methods.
What problem does this paper attempt to address?