Multitask Learning for Visual Question Answering
Jie Ma,Jun Liu,Qika Lin,Bei Wu,Yaxian Wang,Yang You
DOI: https://doi.org/10.1109/tnnls.2021.3105284
IF: 14.255
2021-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Visual question answering (VQA) is a task that machines should provide an accurate natural language answer given an image and a question about the image. Many studies have found that the current VQA methods are heavily driven by the surface correlation or statistical bias in the training data, and lack sufficient image grounding. To address this issue, we devise a novel end-to-end architecture that uses multitask learning to promote more sufficient image grounding and learn effective multimodality representations. The tasks consist of VQA and our proposed image cloze (IC) task requires machines to fill in the blanks accurately given an image and a textual description of the image. To ensure our model performs sufficient image grounding as much as possible, we propose a novel word-masking algorithm to develop the multimodal IC task based on the part-of-speech of words. Our model predicts the VQA answer and fills in the blanks after the multimodality representation learning that is shared by the two tasks. Experimental results show that our model achieves almost the equivalent, state-of-the-art, second-best performance on the VQA v2.0, VQA-changing priors (CP) v2, and grounded question answering (GQA) datasets, respectively, with fewer parameters and without additional data compared with baselines.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, hardware & architecture