Abstract:To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at <a class="link-external link-https" href="https://github.com/Holipori/MIMIC-Diff-VQA" rel="external noopener nofollow">this https URL</a>. We believe this work would further push forward the medical vision language model.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address a new task in medical visual language models—Chest X-ray Difference Visual Question Answering (VQA). Specifically, given a pair of primary and reference images, this task attempts to answer multiple questions about diseases and their differences. This aligns with the practice of radiologists who typically compare the current image with a reference image to assess disease progression before writing a report. ### Background and Motivation 1. **Limitations of Existing Datasets**: - Existing medical VQA datasets (e.g., ImageCLEF-VQA-Med) have overly simplistic questions and lack consideration of heterogeneity and subjectivity. - The types of questions in these datasets are limited and do not provide sufficient clinical information. 2. **Needs of Clinical Practice**: - In actual clinical practice, radiologists assess disease progression by comparing current and past images of the same patient. - The clinical diagnostic process includes Assessment, Diagnosis, Intervention, and Evaluation, requiring a system that can support this process. ### Proposed Method 1. **New Dataset**: - The authors constructed a new dataset, MIMIC-Diff-VQA, containing 700,703 question-answer pairs from 164,324 pairs of primary and reference images. - The types of questions include abnormality, presence, view, location, type, level, and difference, with "difference" questions specifically guiding the model to focus on and locate important areas. 2. **Expert Knowledge-Aware Image Difference Graph Representation Learning Model**: - This model uses anatomical structure priors, semantic, and spatial knowledge to construct a multi-relation graph representing the differences between two images. - Image features are extracted from different anatomical structures and represented as nodes in the graph. - Three types of relationships are constructed: spatial relationships, semantic relationships, and implicit relationships, based on the spatial distance of anatomical structures, the relationships between diseases and anatomical structures, and potential implicit relationships, respectively. ### Main Contributions 1. **Proposed and constructed the first large-scale medical image difference VQA dataset, MIMIC-Diff-VQA**. 2. **Proposed an anatomical structure-aware image difference model** that can extract image difference features related to disease progression and intervention. 3. **Developed a multi-relation image difference graph feature representation learning method**, utilizing spatial and semantic relationships (extracted from expert knowledge graphs) to compute image difference graph feature representations, generate answers, and explain how the answers are generated. ### Experimental Results 1. **Experimental Setup**: - Experiments were implemented on the PyTorch platform, using the Adam optimizer, with a learning rate of 0.0001, trained for 30,000 iterations, and a batch size of 64. - Experiments were conducted on two GeForce RTX 3090 GPUs, with a training time of 3 hours and 49 minutes. 2. **Evaluation Metrics**: - Common text generation evaluation metrics such as BLEU, METEOR, ROUGE_L, and CIDEr were used. - For comparison with MMQ, accuracy was used as the evaluation metric. 3. **Ablation Studies**: - Ablation studies were conducted with different graph structures to verify the effectiveness of each component. ### Conclusion This paper proposes a new medical image difference VQA task and constructs the corresponding dataset, MIMIC-Diff-VQA. By introducing an expert knowledge-aware image difference graph representation learning model, this method achieves good performance in complex medical image difference tasks, providing a new direction for the development of medical visual language models.

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Medical knowledge-based network for Patient-oriented Visual Question Answering

MHKD-MVQA: Multimodal Hierarchical Knowledge Distillation for Medical Visual Question Answering.

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Visual Question Answering in the Medical Domain

Candidate-Heuristic In-Context Learning: A new framework for enhancing medical visual question answering with LLMs

Medical visual question answering: A survey

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Consistency-preserving Visual Question Answering in Medical Imaging

Medical visual question answering with symmetric interaction attention and cross-modal gating

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Medical visual question answering via corresponding feature fusion combined with semantic attention

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns

Vision-knowledge fusion model for multi-domain medical report generation

A Question-Centric Model for Visual Question Answering in Medical Imaging

Self-supervised vision-language pretraining for Medical visual question answering