Abstract:The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the issue of data scarcity encountered in the application of large medical vision-language models (LVLMs) in the medical field. Specifically, although large-scale vision-language models have demonstrated significant capabilities across multiple tasks, their development in the medical domain has been severely limited due to the lack of high-quality image-text data. To overcome this challenge, the authors propose a Diagnosis-Guided Bootstrapping strategy, which utilizes image and label information to construct a vision-language dataset. Based on the constructed dataset, they developed MedDr, a general foundation model capable of handling various medical data modalities such as radiology, pathology, dermatology, retinal imaging, and endoscopy. Additionally, the authors propose a simple Retrieval-Augmented Medical Diagnosis strategy to enhance the model's generalization capability. ### Main Contributions: 1. **Diagnosis-Guided Bootstrapping Strategy**: A novel data generation method is proposed, which generates high-quality medical reports by combining image and text information, ensuring that the generated data is both accurate and informative. 2. **MedDr Model**: A general medical foundation model is developed, capable of handling various medical data modalities and achieving state-of-the-art performance on multiple downstream tasks. 3. **Retrieval-Augmented Medical Diagnosis**: A retrieval-based strategy is proposed, which not only improves the model's prediction accuracy but also enhances its generalization capability. ### Problems Addressed: - **Data Scarcity**: By using the Diagnosis-Guided Bootstrapping strategy, more training data is generated from existing high-quality medical image classification datasets. - **Lack of Generalization Capability**: The retrieval-augmented strategy improves the model's diagnostic accuracy on rare or unseen diseases. ### Experimental Results: - MedDr performs excellently on tasks such as visual question answering, medical report generation, and medical image diagnosis, outperforming other existing models. - The retrieval-augmented strategy further enhances the model's performance, especially when dealing with rare diseases. In summary, this paper significantly improves the performance and reliability of large medical vision-language models in practical applications through innovative data generation and model optimization methods.

MedDr: Diagnosis-Guided Bootstrapping for Large-Scale Medical Vision-Language Learning

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Medical Vision-Language Pre-Training for Brain Abnormalities

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis

DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

MedGo: A Chinese Medical Large Language Model

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

Advancing High Resolution Vision-Language Models in Biomedicine

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day