Abstract:Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in medical image analysis, visual - language models (VLMs) often produce the "hallucination" phenomenon, that is, the text output generated by the model does not match the visual input or the clinical reasoning path. Especially in multi - round conversations, this inconsistency is more obvious, which seriously affects the reliability and accuracy of the model in actual medical applications. Specifically, the paper points out: 1. **Problems in single - round conversations**: Although existing VLMs can provide a certain degree of accuracy in single - round conversations, in multi - round conversations, their responses often lack coherence and consistency, leading to deviations in the diagnosis path and clinical reasoning. 2. **Problems in multi - round conversations**: In multi - round conversations, VLMs need to maintain consistency with the clinical reasoning path, but existing methods such as reinforcement learning based on human feedback (RLHF) are difficult to be applied on a large scale in the medical field because they require the participation of a large number of professional doctors, are costly and difficult to scale. 3. **Limitations of existing methods**: Existing RLHF methods rely on manually - annotated datasets, which are difficult to obtain in the medical field, especially in the case of multi - modality and multi - round conversations. In addition, existing VLMs mainly focus on single - round question - answering tasks and lack support for complex multi - round conversations. To solve these problems, the paper proposes a new alignment algorithm to guide VLMs through symbol - represented clinical reasoning, ensuring that their output is not only accurate in single - round conversations but also maintains consistency with the clinical reasoning path throughout multi - round conversations. Specific methods include: - **Synthesizing multi - round conversation datasets**: Using symbol - represented clinical reasoning rules to automatically generate large - scale multi - round conversation datasets to simulate the interaction between doctors and VLMs. - **Automatic reward function**: Design an automatic reward function to evaluate VLM responses to ensure that their output conforms to the clinical reasoning path in both single - round and multi - round conversations. - **Reinforcement learning fine - tuning**: Use the above - mentioned datasets and reward functions to fine - tune VLMs through reinforcement learning to improve their accuracy and consistency in multi - round conversations. Through these methods, the paper has developed Dr - LLaVA, a VLM specifically used to analyze bone marrow pathological sections, demonstrating performance superior to existing models in both single - round and multi - round conversations.

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Aligning Large Multimodal Models with Factually Augmented RLHF

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Advancing High Resolution Vision-Language Models in Biomedicine

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment