Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval

Gang Hu,Zaidao Wen,Yafei Lv,Jianting Zhang,Qian Wu
DOI: https://doi.org/10.1109/tgrs.2024.3401031
IF: 8.2
2024-05-25
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging task that aims to retrieve target remote-sensing (RS) images based on textual descriptions. However, the modal gap between texts and RS images poses a significant challenge. RS images comprise multiple targets and complex backgrounds, necessitating the mining of both global and local information (GaLR) for effective CMRSITR. Existing approaches primarily focus on local image features while disregarding the local features of the text and their correspondence. These methods typically fuse global and local image features and align them with global text features. However, they struggle to eliminate the influence of cluttered backgrounds and may overlook crucial targets. To address these limitations, we propose a novel framework for CMRSITR based on a transformer architecture, which leverages global–local information soft alignment (GLISA) to enhance retrieval performance. Our framework incorporates a global image extraction module, which captures the global semantic features of image–text pairs and effectively represents the relationships among multiple targets in RS images. In addition, we introduce an adaptive local information extraction (ALIE) module that adaptively mines discriminative local clues from both RS images and texts, aligning the corresponding fine-grained information. To mitigate semantic ambiguities during the alignment of local features, we design a local information soft-alignment (LISA) module. In comparative evaluations using two public CMRSITR datasets, our proposed method achieves state-of-the-art results, surpassing not only traditional cross-modal retrieval methods by a substantial margin but also other contrastive language-image pretraining (CLIP)-based methods.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?