Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

Liqing Chen,Yifan Zhuo,Yingjie Wu,Yilei Wang,Xianghan Zheng
DOI: https://doi.org/10.1007/978-3-030-31723-2_56
2019-01-01
Abstract:Visual Question Answering (VQA) tasks must provide correct answers to the questions posed by given images. Such requirement has been a wide concern since this task was presented. VQA consists of four steps: image feature extraction, question text feature extraction, multi-modal feature fusion and answer reasoning. During multi-modal feature fusion, outer product calculation is used in existing models, which leads to excessive model parameters, high training overhead, and slow convergence. To avoid these problems, we applied the Variational Autoencoder (VAE) method to calculate the probability distribution of the hidden variables of image and question text. Furthermore, we designed a question feature hierarchy method based on the traditional attention mechanism model and VAE. The objective is to investigate deep questions and image correlation features to improve the accuracy of VQA tasks.
What problem does this paper attempt to address?