Abstract:The collaborative reasoning for understanding each image-question pair is very critical but under-explored for an interpretable Visual Question Answering (VQA) system. Although very recent works also tried the explicit compositional processes to assemble multiple sub-tasks embedded in the questions, their models heavily rely on the annotations or hand-crafted rules to obtain valid reasoning layout, leading to either heavy labor or poor performance on composition reasoning. In this paper, to enable global context reasoning for better aligning image and language domains in diverse and unrestricted cases, we propose a novel reasoning network called Adversarial Composition Modular Network (ACMN). This network comprises of two collaborative modules: i) an adversarial attention module to exploit the local visual evidence for each word parsed from the question; ii) a residual composition module to compose the previously mined evidence. Given a dependency parse tree for each question, the adversarial attention module progressively discovers salient regions of one word by densely combining regions of child word nodes in an adversarial manner. Then residual composition module merges the hidden representations of an arbitrary number of children through sum pooling and residual connection. Our ACMN is thus capable of building an interpretable VQA system that gradually dives the image cues following a question-driven reasoning route and makes global reasoning by incorporating the learned knowledge of all attention modules in a principled manner. Experiments on relational datasets demonstrate the superiority of our ACMN and visualization results show the explainable capability of our reasoning system.

Structured Semantic Representation for Visual Question Answering.

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Graph-Structured Representations for Visual Question Answering

Focal and Composed Vision-semantic Modeling for Visual Question Answering.

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

Visual Question Answering Via Attention-based Syntactic Structure Tree-Lstm

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

Visual Question Answering As Reading Comprehension

Compositional Memory for Visual Question Answering

DSGEM: Dual Scene Graph Enhancement Module‐based Visual Question Answering

Semantic-Aware Modular Capsule Routing for Visual Question Answering

Syntax Tree Constrained Graph Network for Visual Question Answering

Video Question Answering with Semantic Disentanglement and Reasoning

Visual Question Answering Via Combining Inferential Attention and Semantic Space Mapping

Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

An effective spatial relational reasoning networks for visual question answering

Visual Question Reasoning on General Dependency Tree