Multi-Question Learning for Visual Question Answering

Chenyi Lei,Lei Wu,Dong Liu,Zhao Li,Guoxin Wang,Haihong Tang,Houqiang Li
DOI: https://doi.org/10.1609/aaai.v34i07.6794
2020-01-01
Proceedings of the AAAI Conference on Artificial Intelligence
Abstract:Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.
What problem does this paper attempt to address?