Enhancing Remote Sensing Visual Question Answering: A Mask-Based Dual-Stream Feature Mutual Attention Network
Yangyang Li,Yunfei Ma,Guangyuan Liu,Qiang Wei,Yanqiao Chen,Ronghua Shang,Licheng Jiao
DOI: https://doi.org/10.1109/lgrs.2024.3389042
IF: 5.343
2024-04-27
IEEE Geoscience and Remote Sensing Letters
Abstract:The visual question answering (VQA) method applied to remote sensing images (RSIs) can complete the interaction of image information and text information, which avoids professional barriers in different RSIs processing fields. The current methods face challenges in both fully using the global and local information of the image to interact with the question information and addressing the issue of interclass interference. To address these challenges, this letter proposes a remote sensing visual question answering (RSVQA) mask-based dual-stream feature mutual attention network (MADNet). First, the dual-stream feature extraction module of the image is used to obtain image features, and the deep and shallow layer feature encoding module is used to obtain question features. Second, the attention mechanism is introduced and combined with the pointwise multiplication method to use the dual-stream features that were extracted in the earlier step. Finally, an answer relevance modulation module based on a binary mask vector is implemented to filter out irrelevant answers. In the experiments, the performance of the proposed strategy is evaluated using two datasets collected by aerial and Sentinel-2 sensors. In our study, we propose a model that outperforms previous approaches, achieving a 6.89% increase in overall accuracy (OA) over the baseline. This enhancement is notable for its persistence, even when the training data are reduced by half, as evidenced by our experiments on the low-resolution (LR) dataset.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics