Abstract:Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient generalization ability of Visual Question Answering (VQA) algorithms between natural image datasets and synthetic datasets. Specifically, most current VQA algorithms perform well when dealing with tasks that require understanding of natural images, but perform poorly on synthetic datasets that require complex reasoning, and vice versa. The paper points out that a good VQA algorithm should be able to handle both types of tasks simultaneously, but currently few algorithms have been fully tested in both fields. Therefore, the author compared the performance of five state - of - the - art VQA algorithms on eight different datasets and found that these methods perform poorly when crossing domains. To solve this problem, the author proposed a new VQA algorithm - RAMEN, which has reached or exceeded the existing state - of - the - art level in both fields.
### Main contributions:
1. **Cross - domain performance evaluation**: The author made a strict comparison of five state - of - the - art VQA algorithms on eight different VQA datasets and found that many algorithms cannot generalize when crossing domains.
2. **Standardized components**: In order to evaluate performance more fairly, the author tried to standardize the components used by each model as much as possible, for example, using the same visual features and answer vocabulary.
3. **Generalization ability analysis**: The author found that most VQA algorithms are insufficient in dealing with real - world images and combinatorial reasoning, and perform poorly in generalization tests, indicating that these methods are still exploiting biases in the datasets.
4. **New algorithm proposal**: The author proposed a new VQA algorithm RAMEN, which performs excellently on all evaluated datasets and has the best overall performance.
### Paper background:
- **VQA datasets**: VQA research is divided into two major camps, one focuses on datasets that require understanding of natural images, and the other focuses on synthetic datasets for testing reasoning ability. Existing VQA datasets such as VQAv1, VQAv2, TDIUC, CVQA, VQACPv2, etc. are mainly used for natural images, while CLEVR, CLEVR - Humans, CoGenT, etc. are used to test multi - step reasoning, counting, and logical reasoning.
- **Existing algorithms**: Current VQA algorithms are mainly divided into two categories, one for natural images and the other for synthetic datasets. However, few algorithms have been fully tested in both fields.
### New algorithm RAMEN:
- **Architecture**: RAMEN deals with the combinatorial reasoning problems in complex natural scenes and synthetic datasets through early fusion of visual and language features, learning bimodal embeddings, and recursive aggregation.
- **Performance**: RAMEN performs excellently on multiple datasets, especially on synthetic datasets, demonstrating its advantage in cross - domain generalization.
### Experimental results:
- **Cross - dataset generalization**: RAMEN has obtained the highest scores on TDIUC and CVQA, and also performs excellently on other datasets, with the highest average score.
- **Cross - task generalization**: On the TDIUC dataset, RAMEN performs the best on multiple task types, especially showing stronger generalization ability when dealing with rare answers.
- **Concept - combination generalization**: On the CVQA and CLEVR - CoGenT - B datasets, RAMEN shows a smaller performance decline when dealing with new concept combinations.
- **Counting and numerical comparison**: On the CLEVR dataset, RAMEN's performance in counting and numerical comparison tasks is second only to MAC, showing its strong ability in complex reasoning tasks.
In conclusion, this paper aims to solve the problem of insufficient generalization ability of VQA algorithms between natural images and synthetic datasets, and proposes a new algorithm RAMEN, which has shown excellent performance in both fields.