Question-Driven Graph Fusion Network For Visual Question Answering

Yuxi Qian,Yuncong Hu,Ruonan Wang,Fangxiang Feng,Xiaojie Wang
DOI: https://doi.org/10.48550/arXiv.2204.00975
2022-04-03
Abstract:Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a significant role in improving the performance of the model.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the Visual Question Answering (VQA) task, existing models will inevitably introduce irrelevant information brought by inaccurate object detection and text localization when exploring various visual relationships between objects in the image to answer complex questions. This irrelevant information will affect the final performance of the model. To meet this challenge, the author proposes a **Question - Driven Graph Fusion Network (QD - GFN)**. ### Specific problem description: 1. **Introduction of irrelevant information**: When dealing with visual relationships in the image, existing VQA models will introduce a large amount of information irrelevant to the question due to the inaccuracy of object detection and text localization, and this information will have a negative impact on the performance of the model. 2. **Coordination of different relationship types**: Different questions may need to focus on different types of relationships (such as semantic relationships, spatial relationships, etc.). How to effectively coordinate these relationships and reduce the interference of irrelevant information is a key issue. 3. **Object filtering**: How to filter out objects irrelevant to the question from the image to improve the accuracy and efficiency of the model. ### Solutions: 1. **Multi - graph attention network**: QD - GFN first uses three Graph Attention Networks (GATs) to model the semantic relationships, spatial relationships and implicit relationships in the image. 2. **Question - guided graph fusion**: Guide the aggregation process of the three graphs through question information to ensure that the model can use relevant relationship information more effectively according to the type and content of the question. 3. **Object filtering mechanism**: Introduce an object priority coefficient to filter out objects irrelevant to the question, thereby reducing the interference of irrelevant information. ### Experimental results: The experimental results show that QD - GFN outperforms the previous state - of - the - art methods on both the VQA 2.0 and VQA - CP v2 datasets. Further analysis shows that the new graph aggregation method and object filtering mechanism play an important role in improving the performance of the model. ### Summary: Through the question - guided graph fusion and object filtering mechanism, QD - GFN effectively reduces the impact of irrelevant information on the performance of the model and improves the accuracy and robustness of the visual question - answering task.