Abstract:The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in <a class="link-external link-https" href="https://github.com/aiming-lab/MMedPO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the factual challenges in medical vision - language models (Med - LVLMs) due to poor modal alignment. Specifically, existing Med - LVLMs tend to give priority to the textual knowledge learned during the training process rather than the actual visual input when processing medical images, which can lead to contradictions between the generated text and the corresponding medical image information, that is, the "hallucination" phenomenon occurs. In addition, previous methods for enhancing modal alignment through preference optimization have failed to fully consider clinical relevance, making these samples easily distinguishable, thereby reducing the effectiveness of the alignment. To address this challenge, the paper proposes MMedPO, a new multi - modal medical preference optimization method, which aims to enhance the alignment effect of Med - LVLM by considering the clinical relevance of preference samples. MMedPO carefully curates multi - modal preference data by introducing two types of non - preference responses: (1) Reasonable hallucinations generated by the target Med - LVLM or GPT - 4, producing medically inaccurate responses; (2) Lesion area neglect achieved by local lesion area noising, interfering with the visual understanding of key areas. Subsequently, the clinical relevance of each sample is calculated based on the scores of multiple Med - LLM and visual tools, and these scores are integrated as weights into the preference optimization process to achieve effective alignment. The experimental results show that MMedPO significantly improves the factual accuracy of Med - LVLM, with an average performance improvement of 14.2% and 51.7% on Med - VQA and report generation tasks respectively, exceeding the existing preference optimization methods.

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Automated Multi-level Preference for MLLMs

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization