Abstract:Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a significant role in improving the performance of the model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the Visual Question Answering (VQA) task, existing models will inevitably introduce irrelevant information brought by inaccurate object detection and text localization when exploring various visual relationships between objects in the image to answer complex questions. This irrelevant information will affect the final performance of the model. To meet this challenge, the author proposes a **Question - Driven Graph Fusion Network (QD - GFN)**. ### Specific problem description: 1. **Introduction of irrelevant information**: When dealing with visual relationships in the image, existing VQA models will introduce a large amount of information irrelevant to the question due to the inaccuracy of object detection and text localization, and this information will have a negative impact on the performance of the model. 2. **Coordination of different relationship types**: Different questions may need to focus on different types of relationships (such as semantic relationships, spatial relationships, etc.). How to effectively coordinate these relationships and reduce the interference of irrelevant information is a key issue. 3. **Object filtering**: How to filter out objects irrelevant to the question from the image to improve the accuracy and efficiency of the model. ### Solutions: 1. **Multi - graph attention network**: QD - GFN first uses three Graph Attention Networks (GATs) to model the semantic relationships, spatial relationships and implicit relationships in the image. 2. **Question - guided graph fusion**: Guide the aggregation process of the three graphs through question information to ensure that the model can use relevant relationship information more effectively according to the type and content of the question. 3. **Object filtering mechanism**: Introduce an object priority coefficient to filter out objects irrelevant to the question, thereby reducing the interference of irrelevant information. ### Experimental results: The experimental results show that QD - GFN outperforms the previous state - of - the - art methods on both the VQA 2.0 and VQA - CP v2 datasets. Further analysis shows that the new graph aggregation method and object filtering mechanism play an important role in improving the performance of the model. ### Summary: Through the question - guided graph fusion and object filtering mechanism, QD - GFN effectively reduces the impact of irrelevant information on the performance of the model and improves the accuracy and robustness of the visual question - answering task.

Question-Driven Graph Fusion Network For Visual Question Answering

Object-difference Drived Graph Convolutional Networks for Visual Question Answering

Information Fusion in Visual Question Answering: A Survey

Graph-enhanced visual representations and question-guided dual attention for visual question answering

Co-attention graph convolutional network for visual question answering

Modular dual-stream visual fusion network for visual question answering

Question-relationship guided graph attention network for visual question answer

In vitro formation of crystalline apatite by matrix vesicles isolated from rachitic rat epiphyseal cartilage.

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Joint Learning of Object Graph and Relation Graph for Visual Question Answering

Graph Reasoning Networks for Visual Question Answering

Multi-Modality Global Fusion Attention Network for Visual Question Answering

Question-guided Feature Pyramid Network for Medical Visual Question Answering

Relation-Aware Graph Attention Network for Visual Question Answering

Scene Graph Refinement Network for Visual Question Answering

Video Question Answering Via Grounded Cross-Attention Network Learning.

Multi-source Multi-level Attention Networks for Visual Question Answering

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Visual Question Answering reasoning with external knowledge based on bimodal graph neural network

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Dual-feature collaborative relation-attention networks for visual question answering