Improved RAMEN: Towards Domain Generalization for Visual Question Answering

Bhanuka Manesha Samarasekara Vitharana Gamage,Lim Chern Hong
DOI: https://doi.org/10.48550/arXiv.2109.02370
2021-09-06
Abstract:Currently nearing human-level performance, Visual Question Answering (VQA) is an emerging area in artificial intelligence. Established as a multi-disciplinary field in machine learning, both computer vision and natural language processing communities are working together to achieve state-of-the-art (SOTA) performance. However, there is a gap between the SOTA results and real world applications. This is due to the lack of model generalisation. The RAMEN model \cite{Shrestha2019} aimed to achieve domain generalization by obtaining the highest score across two main types of VQA datasets. This study provides two major improvements to the early/late fusion module and aggregation module of the RAMEN architecture, with the objective of further strengthening domain generalization. Vector operations based fusion strategies are introduced for the fusion module and the transformer architecture is introduced for the aggregation module. Improvements of up to five VQA datasets from the experiments conducted are evident. Following the results, this study analyses the effects of both the improvements on the domain generalization problem. The code is available on GitHub though the following link \url{<a class="link-external link-https" href="https://github.com/bhanukaManesha/ramen" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the **domain generalization problem** in the field of Visual Question Answering (VQA). Specifically, current VQA models perform close to the human level on standard datasets, but it is difficult to maintain this performance in practical applications. This is because existing models are usually trained and evaluated on specific types of VQA datasets, resulting in their inability to generalize well to unseen data or different types of VQA tasks. #### Main problems: 1. **The gap between SOTA results and real - world applications**: Although VQA models have achieved good results on some datasets, these results cannot be directly translated into good performance in the real world. 2. **Domain generalization problem**: VQA datasets can be divided into two categories: one is question answering based on natural images, and the other is using synthetic images to test reasoning ability. Existing models often can only perform well on one category, but not on both types of datasets simultaneously. 3. **Model over - fitting**: Many models are over - fitted on specific datasets, resulting in their poor performance on other types of datasets. To solve these problems, the author improved the previously proposed RAMEN model, with the focus on enhancing its domain generalization ability. Specific improvements include: - **Improved fusion module**: A fusion strategy based on vector operations was introduced for the early - fusion and late - fusion modules. - **Improved aggregation module**: The Transformer architecture was introduced to replace the original bidirectional GRU network to better capture the relationship between bimodal embeddings. Through these improvements, the author hopes to further improve the generalization ability of VQA models on different datasets, thereby narrowing the gap between the laboratory environment and practical applications. ### Summary The main goal of this paper is to improve the domain generalization ability of VQA models by improving the fusion and aggregation modules of the RAMEN model, so that they can perform more stably and consistently on multiple types of VQA datasets.