Multi-Head Attention Fusion Network for Visual Question Answering

Haiyang Zhang,Ruoyu Li,Liang Liu
DOI: https://doi.org/10.1109/icme52920.2022.9859639
2022-01-01
Abstract:Visual Question Answering (VQA) is a challenging task to answer questions with respect to the image. Most approaches concentrate on utilizing attention networks to focus on crucial objects of the image and key words of the question. However, the attention distribution of these prior attempts tends to lo-cate similar regions, which leads to lack of the ability to derive important entities. To address the issue, we propose a multi-head attention fusion network (MHAFN), which can achieve hierarchical multimodal fusion with various branches to capture the fine-grained and intricate relationship in the perspective of multiple levels: word, region and the interaction of them. Furthermore, it can also capture distinct attention distribution for attending to multiple different visual and textual components that are vital to infer the answer. Extensive experiments on the benchmark of VQA-v2 dataset demonstrate that MHAFN significantly outperforms previous methods.
What problem does this paper attempt to address?