STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Guohao Sun,Can Qin,Huazhu Fu,Linwei Wang,Zhiqiang Tao

2024-10-25

Abstract:Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

This paper aims to solve the problem of the scarcity of high - quality data in medical visual question - answering tasks. Specifically, training large vision - language models (LVLMs) to assist in medical diagnosis requires a large number of biomedical datasets, but the construction of these datasets is both expensive and time - consuming, especially in the medical field. To alleviate this data - scarce problem, the authors propose the self - training large - language and visual assistant (STLLaV A - Med). This method designs a policy model (i.e., LVLM) that can automatically generate medical visual instruction data, thereby improving data efficiency. This process is guided by direct preference optimization (DPO), in which a more powerful and larger LVLM (e.g., GPT - 4o) acts as a biomedical expert to supervise the DPO fine - tuning process, encouraging the policy model to efficiently align with human preferences. Verified on three major medical visual question - answering (VQA) benchmarks, STLLaV A - Med demonstrates its effectiveness and data efficiency, achieving competitive zero - sample performance using only 9% of the medical data. This indicates that STLLaV A - Med can effectively enhance the application ability of LVLMs in the medical field while significantly reducing the amount of required medical data.

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Visual Question Answering in the Medical Domain

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Training Medical Large Vision-Language Models with Abnormal-Aware Feedback

LLaVA-Endo: a Large Language-and-vision Assistant for Gastrointestinal Endoscopy