Question-relationship guided graph attention network for visual question answer

Rui Liu,Liansheng Zhuang,Zhou Yu,Zhihao Jiang,Tian Bai
DOI: https://doi.org/10.1007/s00530-020-00745-7
IF: 3.9
2021-03-15
Multimedia Systems
Abstract:A high-level of understanding about the surrounding context of an image is indispensable for VQA when faced with difficult questions. Previous studies address this issue by modeling object-level visual contents and transforming the internal relationships into a graph or tree. On one hand, however, this still leaves a gap between the modalities of language and vision. On the other hand, the abstract-level contents of the images and the meaning of the relationships between them are ignored. This paper proposes introducing a method of question-relationship guided graph attention network (QRGAT) to study a new representation of the visual features of an image through the guidance of a question and the explicit, internal relationships of objects. Specifically, to narrow the gap between different modalities, visual regions are represented as the combination of their attributes and visual features. Meanwhile, semantic relationships are transformed into the modality of language and used to form updated visual features. The three graph encoders with diverse relationships are considered to capture high-level features of images. Experimental results of the VQA 2.0 model show that our proposed QRGAT outperforms other interpretable visual context structures.
computer science, information systems, theory & methods
What problem does this paper attempt to address?