Abstract:Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve **the problem of insufficient data in the Biomedical Visual Question Answering (Biomedical VQA) task**. Specifically, compared with VQA in the general domain, Biomedical VQA faces the challenge of limited training data, which causes the model to be prone to over - fitting and difficult to learn comprehensive domain knowledge to answer complex biomedical questions. #### Main problems and solutions 1. **The problem of insufficient data**: - In the Biomedical VQA task, the annotated data pairs (image - text pairs) are very limited, which makes the model prone to over - fitting during fine - tuning and difficult to learn sufficient domain knowledge. 2. **The proposed new method**: - **RAMM (Retrieval - augmented Biomedical Visual Question Answering with Multi - modal Pre - training)**: To overcome the problem of insufficient data, the authors propose a retrieval - enhanced multi - modal pre - training and fine - tuning paradigm. - **PMCPM dataset**: The authors construct a new large - scale, high - quality biomedical image - text pair dataset PMCPM, which contains patient - related images and texts from PubMed Central, covering images of multiple modalities and conditions. - **Retrieval - enhancement mechanism**: Through image - text contrastive learning (ITC), the model can retrieve similar image - text pairs in the pre - training dataset and introduce a new retrieval attention module to fuse these retrieved information, so as to make better use of the limited data. #### Specific steps 1. **Construct the PMCPM dataset**: - High - quality patient - related image - text pairs are screened from PubMed Central to construct a large - scale dataset named PMCPM, which is larger and more diverse than the existing ROCO and MIMIC - CXR datasets. 2. **Multi - modal pre - training**: - Use the PMCPM, ROCO and MIMIC - CXR datasets for multi - modal pre - training, including tasks such as masked language modeling (MLM), image - text contrastive learning (ITC) and image - text matching (ITM), in order to learn better visual and textual representations. 3. **Retrieval - enhanced fine - tuning**: - In the fine - tuning stage, relevant image - text pairs are retrieved through ITC similarity, and these retrieved information are fused into the multi - modal encoder using the new retrieval attention module, thereby improving the performance of the model. #### Experimental results - RAMM has achieved state - of - the - art performance on multiple biomedical VQA datasets (such as Med - VQA2019, Med - VQA2021, VQARAD and SLAKE). - Ablation experiments show that the PMCPM dataset and the retrieval - enhancement mechanism are significantly helpful for improving the performance of the biomedical VQA task. ### Summary This paper effectively solves the problem of insufficient data in the Biomedical VQA task by constructing the large - scale PMCPM dataset and introducing the retrieval - enhancement mechanism, and significantly improves the performance of the model on multiple benchmark datasets.

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Self-supervised vision-language pretraining for Medical visual question answering

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

BPI-MVQA: a bi-branch model for medical visual question answering

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Medical visual question answering with symmetric interaction attention and cross-modal gating

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering

MAPM: multiscale attention pre-training model for TextVQA

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

Medical Vision-Language Pre-Training for Brain Abnormalities

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Multimodal fusion: advancing medical visual question-answering