Abstract:The field of visual question answering (VQA) has seen a growing trend of integrating external knowledge sources to improve performance. However, owing to the potential incompleteness of external knowledge sources and the inherent mismatch between different forms of data, current knowledge-based visual question answering (KBVQA) techniques are still confronted with the challenge of effectively integrating and utilizing multiple heterogeneous data. To address this issue, a novel approach centered on a multi-modal semantic graph (MSG) is proposed. The MSG serves as a mechanism for effectively unifying the representation of heterogeneous data and diverse types of knowledge. Additionally, a multi-modal semantic graph knowledge reasoning model (MSG-KRM) is introduced to perform reasoning and deep fusion of image–text information and external knowledge sources. The development of the semantic graph involves extracting keywords from the image object detection information, question text, and external knowledge texts, which are then represented as symbol nodes. Three types of semantic graphs are then constructed based on the knowledge graph, including vision, question, and the external knowledge text, with non-symbol nodes added to connect these three independent graphs and marked with respective node and edge types. During the inference stage, the multi-modal semantic graph and image–text information are embedded into the feature semantic graph through three embedding methods, and a type-aware graph attention module is employed for deep reasoning. The final answer prediction is a blend of the output from the pre-trained model, graph pooling results, and the characteristics of non-symbolic nodes. The experimental results on the OK-VQA dataset show that the MSG-KRM model is superior to existing methods in terms of overall accuracy score, achieving a score of 43.58, and with improved accuracy for most subclass questions, proving the effectiveness of the proposed method.

Representation and Fusion Based on Knowledge Graph in Multi-Modal Semantic Communication

Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph

Knowledge Graph Question Answering with semantic oriented fusion model.

Multi-Modal Fusion-Based Multi-Task Semantic Communication System

A Knowledge-Enhanced Inferential Network for Cross-Modality Multi-hop VQA

Multi-Modal Knowledge Representation: A Survey.

Information Fusion in Visual Question Answering: A Survey

MMKGR: Multi-hop Multi-modal Knowledge Graph Reasoning

Multimodal Knowledge Triple Extraction Based on Representation Learning

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Interpretation on Multi-modal Visual Fusion

Hierarchical Multi-Modality Graph Reasoning for Remote Sensing Visual Question Answering

Multi-modal knowledge graphs representation learning via multi-headed self-attention

Parallel Fusion of Graph and Text with Semantic Enhancement for Commonsense Question Answering

MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multimodal Graph Reasoning and Fusion for Video Question Answering

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Multi-Graph Based Hierarchical Semantic Fusion for Cross-Modal Representation

Is Multi-Level Data Enhancement Helpful for Knowledge Graph? A New Perspective on Multimodal Fusion

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.