Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Cuong Nhat Ha,Shima Asaadi,Sanjeev Kumar Karn,Oladimeji Farri,Tobias Heimann,Thomas Runkler
2024-04-25
Abstract:Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the Medical Visual Question Answering (MedVQA) task, the existing Vision - Language Models (VLMs) perform poorly when dealing with complex problems in specific fields such as the medical field. Specifically, these models often lack in - depth understanding of medical terms and image context, resulting in their performance in the medical VQA task being inferior to that in the general field. Therefore, this paper proposes a new medical vision - language model, aiming to improve the performance in the MedVQA task by fusing domain - adapted vision and language models. ### Specific problems solved by the paper: 1. **Insufficient adaptability in the medical field**: The existing VLMs perform poorly in the medical VQA task because they lack a deep understanding of medical - specific terms and image backgrounds. 2. **Free - text generation vs. classification tasks**: Previous MedVQA methods usually regarded it as a classification task, that is, selecting the correct answer from a predefined answer set. This method limits the model's ability to generate free - form answers and may lead to inaccurate evaluation. 3. **Multi - stage training strategy**: Existing methods usually directly fine - tune general - domain VLMs on downstream tasks without fully utilizing domain - specific data for pre - training, which may lead to limited performance improvement. ### Solutions: 1. **Domain - adapted vision and language models**: The paper proposes a new medical vision - language model that combines a large - scale language model (LLM) customized specifically for radiology and a biomedical vision model. 2. **Parameter - efficient three - stage training**: The training process of the model is divided into three stages: - **First stage**: Align medical concepts through the image - caption prediction task, using the PMC - OA dataset. - **Second stage**: Adapt to the general medical visual question - answering task through the PMC - VQA dataset. - **Third stage**: Fine - tune the downstream task on the VQA - RAD and SLAKE 1.0 datasets. 3. **Low - Rank Adaptation (LoRA) technique**: Apply the LoRA technique on the pre - trained language model to fine - tune in a parameter - efficient manner, ensuring the stability and consistency of the model. ### Experimental results: - **SLAKE 1.0 dataset**: The proposed model achieved an overall accuracy of 87.5% on the SLAKE 1.0 dataset, significantly outperforming the existing methods. - **VQA - RAD dataset**: On the VQA - RAD dataset, the model also performed well, with an overall accuracy of 73.2%. - **Ablation experiment**: The multi - stage training strategy increased the accuracy by approximately 25% compared to directly fine - tuning general - domain VLMs, verifying the effectiveness of this method. In conclusion, this paper effectively solves the problem of insufficient performance of existing VLMs in the medical VQA task by proposing a new medical vision - language model and its multi - stage training strategy.