MERGE: Multi-Entity Relational Reasoning Based Explanation in Visual Question Answering

Mengzhu Pan,Qianmu Li,Tian Qiu
DOI: https://doi.org/10.1109/dasc/picom/cbdcom/cy59711.2023.10361383
2023-01-01
Abstract:To handle VQA tasks in complex scenarios involving multiple entities and obtain reliable explanations, models need to fully understand the high-level semantic information of visual and textual features. Existing VQA methods usually lack the exploration of entity relation features, resulting in insufficient answer prediction accuracy and generated explanations that are not sufficiently relevant to the image and question. To address this issue, we use visual relational reasoning to enhance the overall understanding of image scenes and improve the accuracy of predicted answers and explanations. Our proposed method, named Multi-Entity Relational Reasoning based Explanation (MERGE), leverages the construction of action, spatial, and attribute relations among the question-related entities in images. The contextual visual features are encoded through a graph attention mechanism and fused with question and answer embeddings to generate more accurate textual explanations. To validate the effectiveness of our method, we conducted extensive experiments on seven datasets, including VQA-CP, VQA-X, and CLEVR-X. The results demonstrate improved answer accuracy and high-quality explanations. Furthermore, our results show that the supervisory role of explanations can quantitatively improve the accuracy of answer prediction.
What problem does this paper attempt to address?