Visual-Textual Semantic Alignment Network for Visual Question Answering

Weidong Tian,Yuzheng Zhang,Bin He,Junjun Zhu,Zhongqiu Zhao
DOI: https://doi.org/10.1007/978-3-030-86383-8_21
2021-01-01
Abstract:VQA task requires deep understanding of visual and textual content and access to key information to better answer the question. Most of current works only use image and question as the input of the network, where the image features are over-sampling and the text features are under-sampling, resulting in insufficient alignment between image regions and question words. In this paper, we propose a Visual-Textual Semantic Alignment Network (VTSAN). Our network acquires tags for visual semantics from a target detector and takes the Image-Tag-Question < I, T, Q > triad as the input. The tags can serve as an intermediate medium between the key regions of image and the key words of question, and can greatly enrich the text features. Thereby, the visual-textual semantic alignment is significantly improved. We demonstrate the effectiveness of our proposed network on the standard VQAv2 and VQA-CPv2 benchmarks. The experimental results show that the proposed network outperforms the baseline significantly, especially on the counting questions.
What problem does this paper attempt to address?