Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

Shenghuan Sun,Alexander Schubert,Gregory M. Goldgof,Zhiqing Sun,Thomas Hartvigsen,Atul J. Butte,Ahmed Alaa
2024-10-10
Abstract:Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in medical image analysis, visual - language models (VLMs) often produce the "hallucination" phenomenon, that is, the text output generated by the model does not match the visual input or the clinical reasoning path. Especially in multi - round conversations, this inconsistency is more obvious, which seriously affects the reliability and accuracy of the model in actual medical applications. Specifically, the paper points out: 1. **Problems in single - round conversations**: Although existing VLMs can provide a certain degree of accuracy in single - round conversations, in multi - round conversations, their responses often lack coherence and consistency, leading to deviations in the diagnosis path and clinical reasoning. 2. **Problems in multi - round conversations**: In multi - round conversations, VLMs need to maintain consistency with the clinical reasoning path, but existing methods such as reinforcement learning based on human feedback (RLHF) are difficult to be applied on a large scale in the medical field because they require the participation of a large number of professional doctors, are costly and difficult to scale. 3. **Limitations of existing methods**: Existing RLHF methods rely on manually - annotated datasets, which are difficult to obtain in the medical field, especially in the case of multi - modality and multi - round conversations. In addition, existing VLMs mainly focus on single - round question - answering tasks and lack support for complex multi - round conversations. To solve these problems, the paper proposes a new alignment algorithm to guide VLMs through symbol - represented clinical reasoning, ensuring that their output is not only accurate in single - round conversations but also maintains consistency with the clinical reasoning path throughout multi - round conversations. Specific methods include: - **Synthesizing multi - round conversation datasets**: Using symbol - represented clinical reasoning rules to automatically generate large - scale multi - round conversation datasets to simulate the interaction between doctors and VLMs. - **Automatic reward function**: Design an automatic reward function to evaluate VLM responses to ensure that their output conforms to the clinical reasoning path in both single - round and multi - round conversations. - **Reinforcement learning fine - tuning**: Use the above - mentioned datasets and reward functions to fine - tune VLMs through reinforcement learning to improve their accuracy and consistency in multi - round conversations. Through these methods, the paper has developed Dr - LLaVA, a VLM specifically used to analyze bone marrow pathological sections, demonstrating performance superior to existing models in both single - round and multi - round conversations.