Learning Hierarchical Reasoning for Text-Based Visual Question Answering

Caiyuan Li,Qinyi Du,Qingqing Wang,Yaohui Jin
DOI: https://doi.org/10.1007/978-3-030-86365-4_25
2021-01-01
Abstract:Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent approaches on textVQA regard three modalities as joint input of transformers. However, these implicit reasoning methods do not make full use of multi-modal information, especially visual modality. To this end, we propose a novel model for textVQA based on reasoning explicitly in human-like mode. Firstly, the relevance between different objects and question is obtained. Then, the object modality is fused into the text modality weighted by obtained relevance. Finally, the amended text modality is used to predict the answer. In contrast to previous multi-modal free fusion strategy, our method can make the reasoning process more explicit and robust. Moreover, a prior-based loss is proposed to constrain object-question relevance. Extensive experimental results on several benchmark datasets well demonstrate the superior performance of our hierarchical reasoning framework over current state-of-the-art methods.
What problem does this paper attempt to address?