Abstract:This paper studies the task of Visual Question Answering (VQA), which is topical in Multimedia community recently. Particularly, we explore two critical research problems existed in VQA: (1) efficiently fusing the visual and textual modalities; (2) enabling the visual reasoning ability of VQA models in answering complex questions. To address these challenging problems, a novel Question Guided Modular Routing Networks (QGMRN) has been proposed in this paper. Particularly, The QGMRN is composed of visual, textual and routing network. The visual and textual network serve as the backbones for the generic feature extractors of visual and textual modalities. QGMRN can fuse the visual and textual modalities at multiple semantic levels. Typically, the visual reasoning is facilitated by the routing network in a discrete and stochastic way by using Gumbel-Softmax trick for module selection. When the input reaches a certain modular layer, routing network newly proposed in this paper, dynamically selects a portion of modules from that layer to process the input depending on the question features generated by the textual network. It can also learn to reason by routing between the generic modules without additional supervision information or expert knowledge. Benefiting from the dynamic routing mechanism, QGMRN can outperform the previous classical VQA methods by a large margin and achieve the competitive results against the state-of-the-art methods. Furthermore, attention mechanism is integrated into our QGMRN model and thus can further boost the model performance. Empirically, extensive experiments on the CLEVR and CLEVR-Humans datasets validate the effectiveness of our proposed model, and the state-of-the-art performance has been achieved.

Improving Visual Question Answering with Pre-Trained Language Modeling

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering

Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion

A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Memory Augmented Deep Recurrent Neural Network for Video Question Answering

Question Guided Modular Routing Networks for Visual Question Answering

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

Common Features in the Functional Surface of Scorpion β-Toxins and Elements That Confer Specificity for Insect and Mammalian Voltage-gated Sodium Channels*

Modular dual-stream visual fusion network for visual question answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Research and implementation of visual question and answer system based on deep learning

Visual Question Answering with Memory-Augmented Networks

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach

Compositional Memory for Visual Question Answering

Multitask Learning for Visual Question Answering

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions