Multimodal Graph Transformer for Multimodal Question Answering

Xuehai He,Xin Eric Wang
2023-05-01
Abstract:Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively integrate structured information from different modalities (such as text and visual data) in multi - modal question - answering tasks and perform reasoning on this basis. Although existing Transformer models have been successful in visual and language tasks, they usually learn knowledge implicitly from a large amount of data and cannot directly utilize structured input data. On the other hand, although structured learning methods based on Graph Neural Networks (GNNs) can integrate prior information, they are difficult to match the performance of Transformer models. Therefore, this paper aims to combine the advantages of both and proposes a new multi - modal graph Transformer for question - answering tasks that require reasoning across multiple modalities. Specifically, the paper proposes a plug - in quasi - attention mechanism involving graphs to integrate multi - modal graph information obtained from text and visual data into the traditional self - attention mechanism as effective prior information. By constructing text graphs, dense region graphs and semantic graphs to generate adjacency matrices, and combining these matrices with the input visual and language features to perform downstream reasoning tasks. In this way, self - attention is regularized by graph information, which significantly improves the reasoning ability and helps to align the features of different modalities. The main contributions of the paper include: 1. Proposing a new multi - modal graph Transformer learning framework that combines multi - modal graphs learned from unstructured data with Transformer models. 2. Introducing a modular plug - in graph - involved quasi - attention mechanism, which contains a trainable bias term to guide the information flow during the training process. 3. Verifying the effectiveness of the proposed method through empirical experiments on GQA, VQAv2 and MultiModalQA tasks. In conclusion, in response to the challenges in multi - modal question - answering tasks, this paper proposes an innovative method aiming to improve the reasoning ability of the model and the alignment effect of cross - modal information.