Alignment and Multimodal Reasoning for Remote Sensing Visual Question Answering

Yumin Tian,Haojie Xu,Di Wang,Ke Li,Lin Zhao
DOI: https://doi.org/10.1109/igarss53475.2024.10641340
2024-01-01
Abstract:Recently, visual question answering for remote sensing data (RSVQA) has emerged as a prominent research area in the field of remote sensing. Transformer-based approaches have demonstrated impressive results, attributed to their superior performance in jointly modeling visual and textual modalities. However, existing Remote Sensing Visual Question Answering (RSVQA) methods often overlook the modality biases present in visual-language interactions, leading to in-accuracies in answers. To address this issue, we propose a novel Transformer-based approach aimed at mitigating modality biases in RSVQA. Specifically, we introduce a contrastive learning loss to align image and text representations before cross-modal fusion, facilitating foundational learning of visual and language representations. Subsequently, we design a cross-modal decoder to comprehensively understand the correlations between images and text. Notably, in addition to predicting answers to questions, we incorporate an extra head for regression prediction of question types. Experimental results demonstrate that our approach achieves higher accuracy in answer prediction compared to state-of-the-art (SoTA) methods, establishing a new record.
What problem does this paper attempt to address?