Abstract:Medical Visual Question Answering (VQA) systems play a supporting role to understand clinic-relevant information carried by medical images. The questions to a medical image include two categories: close-end (such as Yes/No question) and open-end. To obtain answers, the majority of the existing medical VQA methods relies on classification approaches, while a few works attempt to use generation approaches or a mixture of the two. The classification approaches are relatively simple but perform poorly on long open-end questions. To bridge this gap, in this paper, we propose a new Transformer based framework for medical VQA (named as Q2ATransformer), which integrates the advantages of both the classification and the generation approaches and provides a unified treatment for the close-end and open-end questions. Specifically, we introduce an additional Transformer decoder with a set of learnable candidate answer embeddings to query the existence of each answer class to a given image-question pair. Through the Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the decision. In this way, despite being a classification-based approach, our method provides a mechanism to interact with the answer information for prediction like the generation-based approaches. On the other hand, by classification, we mitigate the task difficulty by reducing the search space of answers. Our method achieves new state-of-the-art performance on two medical VQA benchmarks. Especially, for the open-end questions, we achieve 79.19% on VQA-RAD and 54.85% on PathVQA, with 16.09% and 41.45% absolute improvements, respectively.

A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Entity-Relation Extraction As Multi-Turn Question Answering

Information Fusion in Visual Question Answering: A Survey

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Information Extraction from Documents: Question Answering vs Token Classification in real-world setups

Task-driven Visual Saliency and Attention-based Visual Question Answering

Question Answering With Character-Level Lstm Encoders And Model-Based Data Augmentation

Spontaneous regression of orbital Langerhans cell granulomatosis in a three-year-old girl.

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Question : What is on the plate ? S of tm ax Linear Tanh ResNet Faster-RCNN GRU Linear Tanh

Using Context Information to Enhance Simple Question Answering

Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

Context-aware Multi-level Question Embedding Fusion for visual question answering

Text-based Visual Question Answering with Knowledge Base.

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection