Question guided multimodal receptive field reasoning network for fact-based visual question answering
Zicheng Zuo,Yanhan Sun,Zhenfang Zhu,Mei Wu,Hui Zhao
DOI: https://doi.org/10.1007/s11042-024-19387-2
IF: 2.577
2024-05-21
Multimedia Tools and Applications
Abstract:Fact-based visual question answering aims at answer questions about images with the help of external knowledge, but understanding the comprehensive semantics of images and text is more challenging than understanding the semantics of text alone. Most existing methods do not take full advantage of the information included in the knowledge graph, which makes it difficult to comprehensively consider the information of all modals in the process of reasoning. In this article, we propose a question guided multi-modal receptive field reasoning network, which expands the receptive field of image to the text and knowledge graph. Specifically, our pattern generates a clue from images and text at each inference step. Subsequently, clues are introduced into the attention's operator to perform explicit and implicit reasoning on the knowledge graph. We will divide the reasoning process into three steps, first encoding the image to identify objects that match the text and scoring them. There is one more point, using the object with the highest score as a clue, the relationships included in the text are picked-up. In the end, use the relationships and objects obtained from the first two steps as clues to retrieve the matching triplets in the knowledge graph. This manner can provide a more precise answer to the question. Without using the Pretrained model, our model improved its performance on the 'FVQA' and 'ZSFVQA' datasets by 1.74% and 0.41%. Our research provides a novel method for future multi-modal retrieval using knowledge graphs. Our code is available at https://github.com/ZuoZicheng/QuMuQA
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering