Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval

Jinghao Huang,Yaxiong Chen,Shengwu Xiong,Xiaoqiang Lu
DOI: https://doi.org/10.1109/tgrs.2024.3443197
IF: 8.2
2024-09-02
IEEE Transactions on Geoscience and Remote Sensing
Abstract:The cross-modal drone image-text (DIT) retrieval task involves using either text or drone images as queries to retrieve relevant drone images or corresponding text. The primary challenge stems from the diverse and intricate nature of drone images, making effective alignment between image and text challenging. In response, we propose an innovative approach called visual contextual semantic reasoning (VCSR), aimed at precisely aligning information across different modalities. VCSR employs textual cues to guide rich semantic reasoning within the visual context, reducing redundancy in visual information. Furthermore, the method captures drone image information relevant to the text, revealing subtle correspondences between drone image regions and textual content. To enhance visual semantic learning, context region learning (CRL) term and consistency semantic alignment (CSA) terms are introduced for stronger guidance, further intensifying the cross-modal interaction between textual and visual data, resulting in more robust feature representation. Extensive experiments conducted on two self-constructed DIT datasets demonstrate that VCSR outperforms alternative methods in terms of DIT retrieval performance. The codes are accessible at https://github.com/huangjh98/VCSR.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?