Abstract:Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively integrate structured information from different modalities (such as text and visual data) in multi - modal question - answering tasks and perform reasoning on this basis. Although existing Transformer models have been successful in visual and language tasks, they usually learn knowledge implicitly from a large amount of data and cannot directly utilize structured input data. On the other hand, although structured learning methods based on Graph Neural Networks (GNNs) can integrate prior information, they are difficult to match the performance of Transformer models. Therefore, this paper aims to combine the advantages of both and proposes a new multi - modal graph Transformer for question - answering tasks that require reasoning across multiple modalities. Specifically, the paper proposes a plug - in quasi - attention mechanism involving graphs to integrate multi - modal graph information obtained from text and visual data into the traditional self - attention mechanism as effective prior information. By constructing text graphs, dense region graphs and semantic graphs to generate adjacency matrices, and combining these matrices with the input visual and language features to perform downstream reasoning tasks. In this way, self - attention is regularized by graph information, which significantly improves the reasoning ability and helps to align the features of different modalities. The main contributions of the paper include: 1. Proposing a new multi - modal graph Transformer learning framework that combines multi - modal graphs learned from unstructured data with Transformer models. 2. Introducing a modular plug - in graph - involved quasi - attention mechanism, which contains a trainable bias term to guide the information flow during the training process. 3. Verifying the effectiveness of the proposed method through empirical experiments on GQA, VQAv2 and MultiModalQA tasks. In conclusion, in response to the challenges in multi - modal question - answering tasks, this paper proposes an innovative method aiming to improve the reasoning ability of the model and the alignment effect of cross - modal information.

Multimodal Graph Transformer for Multimodal Question Answering

DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Discovering Multimodal Hierarchical Structures with Graph Neural Networks for Multi-modal and Multi-hop Question Answering.

So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

Multimodal Graph Reasoning and Fusion for Video Question Answering

Multi-modal Contextual Graph Neural Network for Text Visual Question Answering.

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Graph Reasoning Transformers for Knowledge-Aware Question Answering

When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding and Reasoning

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Co-attention graph convolutional network for visual question answering

Multimodal Cross-guided Attention Networks for Visual Question Answering

Spatially aware multimodal transformers for textvqa

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation