Abstract:Visual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual object relationships. Accurate reasoning and an understanding of the relationships between different visual objects are particularly crucial. However, most reasoning models used in current VQA tasks only use simple attention mechanisms to model visual object relationships and ignore the potential for effective modeling using rich visual object features during the learning process. This work proposes an effective visual object Relationship Reasoning and Adaptive Fusion (RRAF) model to address the shortcomings of existing VQA model research. RRAF can simultaneously model visual objects' position, appearance, and semantic features and uses an adaptive fusion mechanism to achieve fine-grained multimodal reasoning and fusion. Specifically, we designed an effective image encoder to model and learn the relationship between the position and appearance features of visual objects. In addition, in the co-attention module, we employ semantic information from the question to focus on critical visual objects. Finally, we use an adaptive fusion mechanism to reassign weights and fuse different modalities of features to effectively predict the answer. Experimental results show that the RRAF model outperforms current state-of-the-art methods on the VQA 2.0 and GQA datasets, especially in visual object counting problems. We also conducted extensive ablation experiments to demonstrate the effectiveness of the RRAF model, achieving an overall accuracy of 71.33 % and 57.83 % on the VQA 2.0 and GQA datasets, respectively. Code is available at https://github.com/shenxiang-vqa/RRAF.

A Symbolic-Neural Reasoning Model for Visual Question Answering

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering

Perceptual Visual Reasoning with Knowledge Propagation

An effective spatial relational reasoning networks for visual question answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Confidence-based Interactable Neural-Symbolic Visual Question Answering.

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

Joint Answering and Explanation for Visual Commonsense Reasoning

Convincing Rationales for Visual Question Answering Reasoning

VQA-LOL: Visual Question Answering under the Lens of Logic

Explicit Knowledge-based Reasoning for Visual Question Answering

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Interpretable Visual Question Answering via Reasoning Supervision

Toward Accurate Visual Reasoning with Dual-Path Neural Module Networks.

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Relational reasoning and adaptive fusion for visual question answering

Graph Reasoning Networks for Visual Question Answering

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering