Hierarchical Multi-Modality Graph Reasoning for Remote Sensing Visual Question Answering
Han Zhang,Keming Wang,Laixian Zhang,Bingshu Wang,Xuelong Li
DOI: https://doi.org/10.1109/tgrs.2024.3502800
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote sensing visual question answering (RSVQA) targets answering the questions about RS images in natural language form. RSVQA in real-world applications is always challenging, which may contain wide-field visual information and complicated queries. The current methods in RSVQA overlook the semantic hierarchy of visual and linguistic information, and ignore the complex relations of multi-modal instances. Thus, they severely suffer from vital deficiencies in comprehensively representing and associating the vision-language semantics. In this research, we design an innovative end-to-end model, named Hierarchical Multi-modality Graph Reasoning (HMGR) network, which hierarchically learns multi-granular vision-language joint representations, and interactively parses the heterogeneous multi-modal relationships. Specifically, we design a hierarchical vision-language encoder, which could simultaneously represent multi-scale vision features and multi-level language features. Based on the representations, the vision-language semantic graphs are built, and the parallel multi-modal graph relation reasoning is posed, which could explore the complex interaction patterns and implicit semantic relations of both intra-modality and inter-modality instances. Moreover, we raise a distinctive visionquestion (VQ) feature fusion module for the collaboration of information at different semantic levels. Extensive experiments on three public large-scale datasets (RSVQA-LR, RSVQA-HRv1, and RSVQA-HRv2) demonstrate that our work is superior to the state-of-the-art results towards a mass of vision and query types.