HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Junying Chen,Chi Gui,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang

2024-09-30

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing large - language models (such as GPT - 4V) perform poorly in medical multi - modal capabilities, especially when dealing with medical visual knowledge. Specifically, these problems stem from the following aspects: 1. **Limitations in data volume and quality**: Existing medical vision - text datasets are limited in both quantity and quality, mainly due to data privacy issues and high annotation costs. 2. **Data noise**: Although some methods utilize large - scale de - identified medical image - text pairs in PubMed to increase the amount of data, there is noise in these data, which affects the performance of the model. 3. **Neglect of visual information**: Early methods used "blind" large - language models (LLMs) to generate visual question answering (VQA). These models are unable to perceive image inputs, which may lead to inaccurate or irrelevant descriptions. To overcome these problems, the paper proposes the following solutions: 1. **High - precision data screening and reconstruction**: Carefully select high - quality medical image - text pairs from PubMed and use "non - blind" multi - modal large - language models (MLLMs) for data reconstruction to reduce data noise. 2. **Construct a large - scale high - quality dataset**: Through the above method, a high - quality dataset named PubMedVision containing 1.3 million medical VQA samples was constructed. 3. **Train a high - performance medical multi - modal model**: Use PubMedVision to train a 34 - billion - parameter medical multi - modal large - language model - HuatuoGPT - Vision, which performs well in multiple medical multi - modal benchmark tests. Through these methods, the paper aims to significantly improve the performance of existing large - language models in medical multi - modal tasks.

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Advancing High Resolution Vision-Language Models in Biomedicine

LViT: Language meets Vision Transformer in Medical Image Segmentation

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Cross-Modal Self-Supervised Vision Language Pre-training with Multiple Objectives for Medical Visual Question Answering