STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Guohao Sun,Can Qin,Huazhu Fu,Linwei Wang,Zhiqiang Tao
2024-10-25
Abstract:Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the problem of the scarcity of high - quality data in medical visual question - answering tasks. Specifically, training large vision - language models (LVLMs) to assist in medical diagnosis requires a large number of biomedical datasets, but the construction of these datasets is both expensive and time - consuming, especially in the medical field. To alleviate this data - scarce problem, the authors propose the self - training large - language and visual assistant (STLLaV A - Med). This method designs a policy model (i.e., LVLM) that can automatically generate medical visual instruction data, thereby improving data efficiency. This process is guided by direct preference optimization (DPO), in which a more powerful and larger LVLM (e.g., GPT - 4o) acts as a biomedical expert to supervise the DPO fine - tuning process, encouraging the policy model to efficiently align with human preferences. Verified on three major medical visual question - answering (VQA) benchmarks, STLLaV A - Med demonstrates its effectiveness and data efficiency, achieving competitive zero - sample performance using only 9% of the medical data. This indicates that STLLaV A - Med can effectively enhance the application ability of LVLMs in the medical field while significantly reducing the amount of required medical data.