Improving Visual Question Answering with Pre-Trained Language Modeling

Yue Wu,Huiyi Gao,Lei Chen
DOI: https://doi.org/10.1117/12.2574575
2020-01-01
Abstract:Visual question answering is a task of significant importance for research in artificial intelligence. However, most studies often use simple gated recurrent units (GRU) to extract question or image high-level features, and it is not enough for achieving a better performance. In this paper, two improvements are proposed to a general VQA model based on the dynamic memory network (DMN). We initialize the question module of our model using the pre-trained language model. On the other hand, we utilize a new module to replace GRU in the input fusion layer of the input module. Experimental results demonstrate the effectiveness of our method with the improvement of 1.52% on the Visual Question Answering V2 dataset over baseline.
What problem does this paper attempt to address?