Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

Xinyue Hu,Lin Gu,Qiyuan An,Mengliang Zhang,Liangchen Liu,Kazuma Kobayashi,Tatsuya Harada,Ronald M. Summers,Yingying Zhu
DOI: https://doi.org/10.1145/3580305.3599819
2024-08-28
Abstract:To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at <a class="link-external link-https" href="https://github.com/Holipori/MIMIC-Diff-VQA" rel="external noopener nofollow">this https URL</a>. We believe this work would further push forward the medical vision language model.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address a new task in medical visual language models—Chest X-ray Difference Visual Question Answering (VQA). Specifically, given a pair of primary and reference images, this task attempts to answer multiple questions about diseases and their differences. This aligns with the practice of radiologists who typically compare the current image with a reference image to assess disease progression before writing a report. ### Background and Motivation 1. **Limitations of Existing Datasets**: - Existing medical VQA datasets (e.g., ImageCLEF-VQA-Med) have overly simplistic questions and lack consideration of heterogeneity and subjectivity. - The types of questions in these datasets are limited and do not provide sufficient clinical information. 2. **Needs of Clinical Practice**: - In actual clinical practice, radiologists assess disease progression by comparing current and past images of the same patient. - The clinical diagnostic process includes Assessment, Diagnosis, Intervention, and Evaluation, requiring a system that can support this process. ### Proposed Method 1. **New Dataset**: - The authors constructed a new dataset, MIMIC-Diff-VQA, containing 700,703 question-answer pairs from 164,324 pairs of primary and reference images. - The types of questions include abnormality, presence, view, location, type, level, and difference, with "difference" questions specifically guiding the model to focus on and locate important areas. 2. **Expert Knowledge-Aware Image Difference Graph Representation Learning Model**: - This model uses anatomical structure priors, semantic, and spatial knowledge to construct a multi-relation graph representing the differences between two images. - Image features are extracted from different anatomical structures and represented as nodes in the graph. - Three types of relationships are constructed: spatial relationships, semantic relationships, and implicit relationships, based on the spatial distance of anatomical structures, the relationships between diseases and anatomical structures, and potential implicit relationships, respectively. ### Main Contributions 1. **Proposed and constructed the first large-scale medical image difference VQA dataset, MIMIC-Diff-VQA**. 2. **Proposed an anatomical structure-aware image difference model** that can extract image difference features related to disease progression and intervention. 3. **Developed a multi-relation image difference graph feature representation learning method**, utilizing spatial and semantic relationships (extracted from expert knowledge graphs) to compute image difference graph feature representations, generate answers, and explain how the answers are generated. ### Experimental Results 1. **Experimental Setup**: - Experiments were implemented on the PyTorch platform, using the Adam optimizer, with a learning rate of 0.0001, trained for 30,000 iterations, and a batch size of 64. - Experiments were conducted on two GeForce RTX 3090 GPUs, with a training time of 3 hours and 49 minutes. 2. **Evaluation Metrics**: - Common text generation evaluation metrics such as BLEU, METEOR, ROUGE_L, and CIDEr were used. - For comparison with MMQ, accuracy was used as the evaluation metric. 3. **Ablation Studies**: - Ablation studies were conducted with different graph structures to verify the effectiveness of each component. ### Conclusion This paper proposes a new medical image difference VQA task and constructs the corresponding dataset, MIMIC-Diff-VQA. By introducing an expert knowledge-aware image difference graph representation learning model, this method achieves good performance in complex medical image difference tasks, providing a new direction for the development of medical visual language models.