Abstract:Acquiring a large vocabulary is an important aspect of human intelligence. Onecommon approach for human to populating vocabulary is to learn words duringreading or listening, and then use them in writing or speaking. This ability totransfer from input to output is natural for human, but it is difficult for <a class="link-external link-http" href="http://machines.Human" rel="external noopener nofollow">this http URL</a> spontaneously performs this knowledge transfer in complicated multimodaltasks, such as Visual Question Answering (VQA). In order to approach human-levelArtificial Intelligence, we hope to equip machines with such ability. Therefore, toaccelerate this research, we propose a newzero-shot transfer VQA(ZST-VQA)dataset by reorganizing the existing VQA v1.0 dataset in the way that duringtraining, some words appear only in one module (i.e. questions) but not in theother (i.e. answers). In this setting, an intelligent model should understand andlearn the concepts from one module (i.e. questions), and at test time, transfer themto the other (i.e. predict the concepts as answers). We conduct evaluation on thisnew dataset using three existing state-of-the-art VQA neural models. Experimentalresults show a significant drop in performance on this dataset, indicating existingmethods do not address the zero-shot transfer problem. Besides, our analysis findsthat this may be caused by the implicit bias learned during training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve zero - shot transfer learning in the visual question answering (VQA) task. Specifically, the author focuses on how to make the machine only be exposed to one modality of certain words during the training process (for example, the words in the question), and be able to successfully transfer these words to another modality (for example, the words in the answer) during the test. This ability is natural for humans, but very difficult for machines. Existing VQA models perform poorly in handling such zero - shot transfer tasks, indicating that they lack true understanding ability and transfer learning ability. To evaluate the performance of existing VQA models in this regard, the author proposes a new dataset - the zero - shot transfer VQA (ZST - VQA) dataset. This dataset is created by reorganizing the existing VQA v1.0 dataset, ensuring that some words only appear in the questions during the training stage and do not appear in the answers, and vice versa. The purpose of such a design is to test whether the model can perform effective transfer learning on unseen words. The experimental results show that the performance of the three existing state - of - the - art VQA models on the ZST - VQA dataset drops significantly, especially in the zero - shot answer (ZSA) task, where the test accuracy drops to 0%, indicating that the current models do not have the ability of zero - shot transfer learning. Further analysis finds that the implicit bias in the training data may be one of the reasons for the performance degradation. These findings emphasize the importance of considering zero - shot transfer learning in VQA research and point out the directions for future research, such as enhancing the transfer learning ability of the model by improving the network architecture, loss function or regularization method.

Zero-Shot Transfer VQA Dataset

Simple and Effective Visual Question Answering in a Single Modality

Zero-Shot Detection with Transferable Object Proposal Mechanism.

Zero-Shot Visual Question Answering Using Knowledge Graph

Overcoming the Limitations of Learning-Based VQA for Counting Questions with Zero-Shot Learning

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models

Meta-Transfer Networks for Zero-Shot Learning

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

Modularized Zero-shot VQA with Pre-trained Models

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Exploring Question Decomposition for Zero-Shot VQA

Zero-shot Question Generation: Accelerate the Development of Domain-specific Dialogue System

Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning

Good Questions Help Zero-Shot Image Reasoning

Zero-shot Visual Question Answering with Language Model Feedback

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario

ZVQAF: Zero-shot visual question answering with feedback from large language models

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models