Zheng Yuan,Qiao Jin,Chuanqi Tan,Zhengyun Zhao,Hongyi Yuan,Fei Huang,Songfang Huang
Abstract:Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve **the problem of insufficient data in the Biomedical Visual Question Answering (Biomedical VQA) task**. Specifically, compared with VQA in the general domain, Biomedical VQA faces the challenge of limited training data, which causes the model to be prone to over - fitting and difficult to learn comprehensive domain knowledge to answer complex biomedical questions.
#### Main problems and solutions
1. **The problem of insufficient data**:
- In the Biomedical VQA task, the annotated data pairs (image - text pairs) are very limited, which makes the model prone to over - fitting during fine - tuning and difficult to learn sufficient domain knowledge.
2. **The proposed new method**:
- **RAMM (Retrieval - augmented Biomedical Visual Question Answering with Multi - modal Pre - training)**: To overcome the problem of insufficient data, the authors propose a retrieval - enhanced multi - modal pre - training and fine - tuning paradigm.
- **PMCPM dataset**: The authors construct a new large - scale, high - quality biomedical image - text pair dataset PMCPM, which contains patient - related images and texts from PubMed Central, covering images of multiple modalities and conditions.
- **Retrieval - enhancement mechanism**: Through image - text contrastive learning (ITC), the model can retrieve similar image - text pairs in the pre - training dataset and introduce a new retrieval attention module to fuse these retrieved information, so as to make better use of the limited data.
#### Specific steps
1. **Construct the PMCPM dataset**:
- High - quality patient - related image - text pairs are screened from PubMed Central to construct a large - scale dataset named PMCPM, which is larger and more diverse than the existing ROCO and MIMIC - CXR datasets.
2. **Multi - modal pre - training**:
- Use the PMCPM, ROCO and MIMIC - CXR datasets for multi - modal pre - training, including tasks such as masked language modeling (MLM), image - text contrastive learning (ITC) and image - text matching (ITM), in order to learn better visual and textual representations.
3. **Retrieval - enhanced fine - tuning**:
- In the fine - tuning stage, relevant image - text pairs are retrieved through ITC similarity, and these retrieved information are fused into the multi - modal encoder using the new retrieval attention module, thereby improving the performance of the model.
#### Experimental results
- RAMM has achieved state - of - the - art performance on multiple biomedical VQA datasets (such as Med - VQA2019, Med - VQA2021, VQARAD and SLAKE).
- Ablation experiments show that the PMCPM dataset and the retrieval - enhancement mechanism are significantly helpful for improving the performance of the biomedical VQA task.
### Summary
This paper effectively solves the problem of insufficient data in the Biomedical VQA task by constructing the large - scale PMCPM dataset and introducing the retrieval - enhancement mechanism, and significantly improves the performance of the model on multiple benchmark datasets.