Zero-Shot Transfer VQA Dataset

Yuanpeng Li,Yi Yang,Jianyu Wang,Wei Xu
DOI: https://doi.org/10.48550/arXiv.1811.00692
2018-11-02
Abstract:Acquiring a large vocabulary is an important aspect of human intelligence. Onecommon approach for human to populating vocabulary is to learn words duringreading or listening, and then use them in writing or speaking. This ability totransfer from input to output is natural for human, but it is difficult for <a class="link-external link-http" href="http://machines.Human" rel="external noopener nofollow">this http URL</a> spontaneously performs this knowledge transfer in complicated multimodaltasks, such as Visual Question Answering (VQA). In order to approach human-levelArtificial Intelligence, we hope to equip machines with such ability. Therefore, toaccelerate this research, we propose a newzero-shot transfer VQA(ZST-VQA)dataset by reorganizing the existing VQA v1.0 dataset in the way that duringtraining, some words appear only in one module (i.e. questions) but not in theother (i.e. answers). In this setting, an intelligent model should understand andlearn the concepts from one module (i.e. questions), and at test time, transfer themto the other (i.e. predict the concepts as answers). We conduct evaluation on thisnew dataset using three existing state-of-the-art VQA neural models. Experimentalresults show a significant drop in performance on this dataset, indicating existingmethods do not address the zero-shot transfer problem. Besides, our analysis findsthat this may be caused by the implicit bias learned during training.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve zero - shot transfer learning in the visual question answering (VQA) task. Specifically, the author focuses on how to make the machine only be exposed to one modality of certain words during the training process (for example, the words in the question), and be able to successfully transfer these words to another modality (for example, the words in the answer) during the test. This ability is natural for humans, but very difficult for machines. Existing VQA models perform poorly in handling such zero - shot transfer tasks, indicating that they lack true understanding ability and transfer learning ability. To evaluate the performance of existing VQA models in this regard, the author proposes a new dataset - the zero - shot transfer VQA (ZST - VQA) dataset. This dataset is created by reorganizing the existing VQA v1.0 dataset, ensuring that some words only appear in the questions during the training stage and do not appear in the answers, and vice versa. The purpose of such a design is to test whether the model can perform effective transfer learning on unseen words. The experimental results show that the performance of the three existing state - of - the - art VQA models on the ZST - VQA dataset drops significantly, especially in the zero - shot answer (ZSA) task, where the test accuracy drops to 0%, indicating that the current models do not have the ability of zero - shot transfer learning. Further analysis finds that the implicit bias in the training data may be one of the reasons for the performance degradation. These findings emphasize the importance of considering zero - shot transfer learning in VQA research and point out the directions for future research, such as enhancing the transfer learning ability of the model by improving the network architecture, loss function or regularization method.