Abstract:Diagram Question Answering (DQA) aims to correctly answer questions about given diagrams, which demands an interplay of good diagram understanding and effective reasoning. However, the same appearance of objects in diagrams can express different semantics. This kind of visual semantic ambiguity problem makes it challenging to represent diagrams sufficiently for better understanding. Moreover, since there are questions about diagrams from different perspectives, it is also crucial to perform flexible and adaptive reasoning on content-rich diagrams. In this paper, we propose a Disentangled Adaptive Visual Reasoning Network for DQA, named DisAVR, to jointly optimize the dual-process of representation and reasoning. DisAVR mainly comprises three modules: improved region feature learning, question parsing, and disentangled adaptive reasoning. Specifically, the improved region feature learning module is designed to first learn robust diagram representation by integrating detail-aware patch features and semantically-explicit text features with region features. Subsequently, the question parsing module decomposes the question into three types of question guidance including region, spatial relation and semantic relation guidance to dynamically guide subsequent reasoning. Next, the disentangled adaptive reasoning module decomposes the whole reasoning process by employing three visual reasoning cells to construct a soft fully-connected multi-layer stacked routing space. These three cells in each layer reason over object regions, semantic and spatial relations in the diagram under the corresponding question guidance. Moreover, an adaptive routing mechanism is designed to flexibly explore more optimal reasoning paths for specific diagram-question pairs. Extensive experiments on three DQA datasets demonstrate the superiority of our DisAVR.

DIEM: Decomposition-Integration Enhancing Multimodal Insights

Simple and Effective Visual Question Answering in a Single Modality

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Visual Question Decomposition on Multimodal Large Language Models

Visual Question Answering With Dense Inter- and Intra-Modality Interactions

Medical visual question answering with symmetric interaction attention and cross-modal gating

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

Context-aware Multi-level Question Embedding Fusion for visual question answering

DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Multitask Learning for Visual Question Answering

AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering

Spontaneous regression of orbital Langerhans cell granulomatosis in a three-year-old girl.

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Detached and Interactive Multimodal Learning

Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering

Tackling Vision Language Tasks Through Learning Inner Monologues

RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering

Structural changes in mitochondria induced by uncoupling reagents. The response to snake-venom phospholipase A.

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models