Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Linqin Cai,Haodu Fang,Zhiqing Li
DOI: https://doi.org/10.1007/s11227-023-05195-2
IF: 3.3
2023-03-29
The Journal of Supercomputing
Abstract:Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?