Visual Global-Salient-Guided Network for Remote Sensing Image-Text Retrieval

Yangpeng He,Xin Xu,Hongjia Chen,Jinwen Li,Fangling Pu
DOI: https://doi.org/10.1109/tgrs.2024.3466389
IF: 8.2
2024-10-08
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Amid the brisk evolution of remote sensing (RS) technology, the domain of RS cross-modal text-image retrieval (RSCTIR) has captivated scholarly interest for its superior adaptability and symbiotic interaction with human operators. However, due to the heterogeneity between image and text data modalities, feature alignment poses a significant challenge. The existing methodologies overlook the sufficient incorporation of structural guidance during the cross-modal feature interaction alignment process to foster alignment between text and image features. In light of this, we propose an innovative approach for RS image-text retrieval task called visual global-salient-guided network (VGSGN), which comprises two branches: the image branch and the text branch. In the image branch, visual global-salient information sensing module (VGSM) is devised to extract visual global and salient features, aiming to enhance the perception capability for complex backgrounds and scenes in RS images. In the text branch, the textual graph enhancement module (TGEM) is crafted to filter out redundant information in the text features and capture the interactions between words within the text. The design of the multiple visual-guided dynamic fusion (MVGF) module aims to leverage the global and salient features of image to guide the text feature, facilitating cross-modal alignment of text and image features. The experimental results on the widely recognized RSICD and RSITMD datasets corroborate the effectiveness and advancement of our proposed VGSGN in tackling the RSCTIR task.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?