Improved RAMEN: Towards Domain Generalization for Visual Question Answering

Bhanuka Manesha Samarasekara Vitharana Gamage,Lim Chern Hong

DOI: https://doi.org/10.48550/arXiv.2109.02370

2021-09-06

Abstract:Currently nearing human-level performance, Visual Question Answering (VQA) is an emerging area in artificial intelligence. Established as a multi-disciplinary field in machine learning, both computer vision and natural language processing communities are working together to achieve state-of-the-art (SOTA) performance. However, there is a gap between the SOTA results and real world applications. This is due to the lack of model generalisation. The RAMEN model \cite{Shrestha2019} aimed to achieve domain generalization by obtaining the highest score across two main types of VQA datasets. This study provides two major improvements to the early/late fusion module and aggregation module of the RAMEN architecture, with the objective of further strengthening domain generalization. Vector operations based fusion strategies are introduced for the fusion module and the transformer architecture is introduced for the aggregation module. Improvements of up to five VQA datasets from the experiments conducted are evident. Following the results, this study analyses the effects of both the improvements on the domain generalization problem. The code is available on GitHub though the following link \url{<a class="link-external link-https" href="https://github.com/bhanukaManesha/ramen" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the **domain generalization problem** in the field of Visual Question Answering (VQA). Specifically, current VQA models perform close to the human level on standard datasets, but it is difficult to maintain this performance in practical applications. This is because existing models are usually trained and evaluated on specific types of VQA datasets, resulting in their inability to generalize well to unseen data or different types of VQA tasks. #### Main problems: 1. **The gap between SOTA results and real - world applications**: Although VQA models have achieved good results on some datasets, these results cannot be directly translated into good performance in the real world. 2. **Domain generalization problem**: VQA datasets can be divided into two categories: one is question answering based on natural images, and the other is using synthetic images to test reasoning ability. Existing models often can only perform well on one category, but not on both types of datasets simultaneously. 3. **Model over - fitting**: Many models are over - fitted on specific datasets, resulting in their poor performance on other types of datasets. To solve these problems, the author improved the previously proposed RAMEN model, with the focus on enhancing its domain generalization ability. Specific improvements include: - **Improved fusion module**: A fusion strategy based on vector operations was introduced for the early - fusion and late - fusion modules. - **Improved aggregation module**: The Transformer architecture was introduced to replace the original bidirectional GRU network to better capture the relationship between bimodal embeddings. Through these improvements, the author hopes to further improve the generalization ability of VQA models on different datasets, thereby narrowing the gap between the laboratory environment and practical applications. ### Summary The main goal of this paper is to improve the domain generalization ability of VQA models by improving the fusion and aggregation modules of the RAMEN model, so that they can perform more stably and consistently on multiple types of VQA datasets.

Improved RAMEN: Towards Domain Generalization for Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Information Fusion in Visual Question Answering: A Survey

Context-aware Multi-level Question Embedding Fusion for visual question answering

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Modular dual-stream visual fusion network for visual question answering

Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration

Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion

Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

Question Guided Modular Routing Networks for Visual Question Answering

Transformer Module Networks for Systematic Generalization in Visual Question Answering

A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Multi-Modality Global Fusion Attention Network for Visual Question Answering

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering

Relational reasoning and adaptive fusion for visual question answering

Question : What is on the plate ? S of tm ax Linear Tanh ResNet Faster-RCNN GRU Linear Tanh

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network