Abstract:In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with reference to a given image. VQA is challenging, because the reasoning process on a visual domain needs a full understanding of the spatial relationship, semantic concepts, as well as the common sense for a real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to the lack of modeling of the rich spatial context of regions, high-level semantics of images, and knowledge across multiple sources. To solve the challenges, we propose multi-source multi-level attention networks for visual question answering that can benefit both spatial inferences by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level representation from image source as well as sentence level representation from the external knowledge base. First, we encode region-based middle-level outputs from Convolutional Neural Networks (CNNs) into spatially embedded representation by a multi-directional two-dimensional recurrent neural network and, further, locate the answer-related regions by Multiple Layer Perceptron as visual attention. Second, we generate semantic concepts from high-level semantics in CNNs and select those question-related concepts as concept attention. Third, we query semantic knowledge from the general knowledge base by concepts and selected question-related knowledge as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention, and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Question-guided Feature Pyramid Network for Medical Visual Question Answering

Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

Structural changes in mitochondria induced by uncoupling reagents. The response to snake-venom phospholipase A.

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Medical visual question answering with symmetric interaction attention and cross-modal gating

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in medical visual question answering

The frequency of antlered female and anterless male roe deer (Capreolus capreolus) in a population in south-east Norway

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Self-supervised vision-language pretraining for Medical visual question answering

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

A Bi-level representation learning model for medical visual question answering

MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

[A medical visual question answering approach based on co-attention networks]

Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning

A Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question Answering

Multimodal fusion: advancing medical visual question-answering

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Multi-source Multi-level Attention Networks for Visual Question Answering

Hierarchical deep multi-modal network for medical visual question answering