Abstract:The field of visual question answering (VQA) has seen a growing trend of integrating external knowledge sources to improve performance. However, owing to the potential incompleteness of external knowledge sources and the inherent mismatch between different forms of data, current knowledge-based visual question answering (KBVQA) techniques are still confronted with the challenge of effectively integrating and utilizing multiple heterogeneous data. To address this issue, a novel approach centered on a multi-modal semantic graph (MSG) is proposed. The MSG serves as a mechanism for effectively unifying the representation of heterogeneous data and diverse types of knowledge. Additionally, a multi-modal semantic graph knowledge reasoning model (MSG-KRM) is introduced to perform reasoning and deep fusion of image–text information and external knowledge sources. The development of the semantic graph involves extracting keywords from the image object detection information, question text, and external knowledge texts, which are then represented as symbol nodes. Three types of semantic graphs are then constructed based on the knowledge graph, including vision, question, and the external knowledge text, with non-symbol nodes added to connect these three independent graphs and marked with respective node and edge types. During the inference stage, the multi-modal semantic graph and image–text information are embedded into the feature semantic graph through three embedding methods, and a type-aware graph attention module is employed for deep reasoning. The final answer prediction is a blend of the output from the pre-trained model, graph pooling results, and the characteristics of non-symbolic nodes. The experimental results on the OK-VQA dataset show that the MSG-KRM model is superior to existing methods in terms of overall accuracy score, achieving a score of 43.58, and with improved accuracy for most subclass questions, proving the effectiveness of the proposed method.

Boosting Visual Question Answering with Context-aware Knowledge Aggregation

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph

Zero-Shot Visual Question Answering Using Knowledge Graph

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Text-based Visual Question Answering with Knowledge Base.

Visual Question Answering reasoning with external knowledge based on bimodal graph neural network

Multi-modal Contextual Graph Neural Network for Text Visual Question Answering.

Knowledge Condensation and Reasoning for Knowledge-based VQA

K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

Query and Attention Augmentation for Knowledge-Based Explainable Reasoning

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Rethinking Data Augmentation for Robust Visual Question Answering

Learning to Supervise Knowledge Retrieval over a Tree Structure for Visual Question Answering