Injecting Linguistic into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding

Chongyang Li,Wenkai Zhang,Hanbo Bi,Jihao Li,Shuoke Li,Haichen Yu,Xian Sun,Hongqi Wang
DOI: https://doi.org/10.1109/tgrs.2024.3450303
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:The Remote Sensing Visual Grounding (RSVG) task focuses on accurately identifying and localizing specific targets in remote sensing (RS) images using descriptive query expressions. Existing methods independently extract visual and textual features, ignoring early complementary information between image and text. This leads to information loss and misalignment, limiting the model’s ability to distinguish similar targets. To address this challenge, we propose the Query-Aware Multimodal Fusion Network (QAMFN), which introduces an innovative Query-Guided Visual Attention (QGVA) mechanism in the early stages of the visual encoder. This mechanism integrates textual information during the early visual feature extraction process, thereby resolving the issue of missing image-text complementary information. QGVA ensures that the visual backbone accurately focuses on local features highly relevant to the query by injecting textual information into the visual encoding process. Additionally, to enhance the model’s ability to integrate multimodal information and adapt to more complex RS images, we introduce the Text-Semantic Attention-Guided Masking (TAM) module. TAM aggregates multimodal features processed by the backbones and filters out redundant information, producing high-quality fused features. Experiments demonstrate that our approach sets a new record on the DIOR-RSVG dataset, improving accuracy to 81.67% (an absolute increase of 4.98%).
What problem does this paper attempt to address?