Discovering Multimodal Hierarchical Structures with Graph Neural Networks for Multi-modal and Multi-hop Question Answering.

Qing Zhang,Haocheng Lv,Jie Liu,Zhiyun Chen,Jianyong Duan,Mingying Xv,Hao Wang
DOI: https://doi.org/10.1007/978-981-99-8429-9_31
2024-01-01
Abstract:Multimodal reasoning is a challenging task that requires understanding and integrating information from different modalities, such as text and image. Existing methods for multimodal reasoning often fail to capture the rich structural information among visual and textual semantics in different modalities, which is crucial for generating accurate answers. In this paper, we propose a novel method that leverages graph neural networks to model the structural information to enhance multimodal reasoning. Specifically, we first use a Multimodal and Multi-hop reader to attend to different chunks in the context based on the question, and then search for multi-hop candidate tokens within these chunks. Next, we construct a graph to represent the relations among the chunks. Then we apply a Sparse Matrix-Tree algorithm to learn a hierarchical informative structure. Then, we use a Hierarchy-aware Message Passing mechanism to perform multi-hop reasoning on the selected edges and update the node representations. Finally, we use a graph-selection decoder to generate the answer based on the structure-enriched chunk representation. We conduct experiments on the WebQA dataset, which is a large-scale multimodal question answering dataset [1]. The results show that our method outperforms the baseline methods in terms of reasoning and the overall answer accuracy. We also provide some qualitative analysis to illustrate how our method benefits from the structural information among different modalities.
What problem does this paper attempt to address?