RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Zheng Yuan,Qiao Jin,Chuanqi Tan,Zhengyun Zhao,Hongyi Yuan,Fei Huang,Songfang Huang
2023-03-01
Abstract:Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve **the problem of insufficient data in the Biomedical Visual Question Answering (Biomedical VQA) task**. Specifically, compared with VQA in the general domain, Biomedical VQA faces the challenge of limited training data, which causes the model to be prone to over - fitting and difficult to learn comprehensive domain knowledge to answer complex biomedical questions. #### Main problems and solutions 1. **The problem of insufficient data**: - In the Biomedical VQA task, the annotated data pairs (image - text pairs) are very limited, which makes the model prone to over - fitting during fine - tuning and difficult to learn sufficient domain knowledge. 2. **The proposed new method**: - **RAMM (Retrieval - augmented Biomedical Visual Question Answering with Multi - modal Pre - training)**: To overcome the problem of insufficient data, the authors propose a retrieval - enhanced multi - modal pre - training and fine - tuning paradigm. - **PMCPM dataset**: The authors construct a new large - scale, high - quality biomedical image - text pair dataset PMCPM, which contains patient - related images and texts from PubMed Central, covering images of multiple modalities and conditions. - **Retrieval - enhancement mechanism**: Through image - text contrastive learning (ITC), the model can retrieve similar image - text pairs in the pre - training dataset and introduce a new retrieval attention module to fuse these retrieved information, so as to make better use of the limited data. #### Specific steps 1. **Construct the PMCPM dataset**: - High - quality patient - related image - text pairs are screened from PubMed Central to construct a large - scale dataset named PMCPM, which is larger and more diverse than the existing ROCO and MIMIC - CXR datasets. 2. **Multi - modal pre - training**: - Use the PMCPM, ROCO and MIMIC - CXR datasets for multi - modal pre - training, including tasks such as masked language modeling (MLM), image - text contrastive learning (ITC) and image - text matching (ITM), in order to learn better visual and textual representations. 3. **Retrieval - enhanced fine - tuning**: - In the fine - tuning stage, relevant image - text pairs are retrieved through ITC similarity, and these retrieved information are fused into the multi - modal encoder using the new retrieval attention module, thereby improving the performance of the model. #### Experimental results - RAMM has achieved state - of - the - art performance on multiple biomedical VQA datasets (such as Med - VQA2019, Med - VQA2021, VQARAD and SLAKE). - Ablation experiments show that the PMCPM dataset and the retrieval - enhancement mechanism are significantly helpful for improving the performance of the biomedical VQA task. ### Summary This paper effectively solves the problem of insufficient data in the Biomedical VQA task by constructing the large - scale PMCPM dataset and introducing the retrieval - enhancement mechanism, and significantly improves the performance of the model on multiple benchmark datasets.