Abstract:Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the Medical Visual Question Answering (MedVQA) task, the existing Vision - Language Models (VLMs) perform poorly when dealing with complex problems in specific fields such as the medical field. Specifically, these models often lack in - depth understanding of medical terms and image context, resulting in their performance in the medical VQA task being inferior to that in the general field. Therefore, this paper proposes a new medical vision - language model, aiming to improve the performance in the MedVQA task by fusing domain - adapted vision and language models. ### Specific problems solved by the paper: 1. **Insufficient adaptability in the medical field**: The existing VLMs perform poorly in the medical VQA task because they lack a deep understanding of medical - specific terms and image backgrounds. 2. **Free - text generation vs. classification tasks**: Previous MedVQA methods usually regarded it as a classification task, that is, selecting the correct answer from a predefined answer set. This method limits the model's ability to generate free - form answers and may lead to inaccurate evaluation. 3. **Multi - stage training strategy**: Existing methods usually directly fine - tune general - domain VLMs on downstream tasks without fully utilizing domain - specific data for pre - training, which may lead to limited performance improvement. ### Solutions: 1. **Domain - adapted vision and language models**: The paper proposes a new medical vision - language model that combines a large - scale language model (LLM) customized specifically for radiology and a biomedical vision model. 2. **Parameter - efficient three - stage training**: The training process of the model is divided into three stages: - **First stage**: Align medical concepts through the image - caption prediction task, using the PMC - OA dataset. - **Second stage**: Adapt to the general medical visual question - answering task through the PMC - VQA dataset. - **Third stage**: Fine - tune the downstream task on the VQA - RAD and SLAKE 1.0 datasets. 3. **Low - Rank Adaptation (LoRA) technique**: Apply the LoRA technique on the pre - trained language model to fine - tune in a parameter - efficient manner, ensuring the stability and consistency of the model. ### Experimental results: - **SLAKE 1.0 dataset**: The proposed model achieved an overall accuracy of 87.5% on the SLAKE 1.0 dataset, significantly outperforming the existing methods. - **VQA - RAD dataset**: On the VQA - RAD dataset, the model also performed well, with an overall accuracy of 73.2%. - **Ablation experiment**: The multi - stage training strategy increased the accuracy by approximately 25% compared to directly fine - tuning general - domain VLMs, verifying the effectiveness of this method. In conclusion, this paper effectively solves the problem of insufficient performance of existing VLMs in the medical VQA task by proposing a new medical vision - language model and its multi - stage training strategy.

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Vision–Language Model for Visual Question Answering in Medical Imagery

Multimodal fusion: advancing medical visual question-answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

The frequency of antlered female and anterless male roe deer (Capreolus capreolus) in a population in south-east Norway

Medical visual question answering with symmetric interaction attention and cross-modal gating

Visual Question Answering in the Medical Domain

Medical Vision-Language Pre-Training for Brain Abnormalities

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Question-guided Feature Pyramid Network for Medical Visual Question Answering

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

A Question-Centric Model for Visual Question Answering in Medical Imaging

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning