Abstract:Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region's feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.

VQMG: Hierarchical Vector Quantised and Multi-hops Graph Reasoning for Explicit Representation Learning

Neural Discrete Representation Learning

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Improving Variational Graph Autoencoders With Multi-Order Graph Convolutions

Generating Diverse High-Fidelity Images with VQ-VAE-2

Joint Learning of Object Graph and Relation Graph for Visual Question Answering

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

HyperVQ: MLR-based Vector Quantization in Hyperbolic Space

Graph-enhanced visual representations and question-guided dual attention for visual question answering

Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability

Vector Quantized Wasserstein Auto-Encoder

VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs

Predicting Video with VQVAE

Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

D-VAE: A Variational Autoencoder for Directed Acyclic Graphs

Question-relationship guided graph attention network for visual question answer

Depthwise Discrete Representation Learning

Multi-Level Variational Autoencoder: Learning Disentangled Representations From Grouped Observations

From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering