Abstract:The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in <a class="link-external link-https" href="https://github.com/aiming-lab/MMedPO" rel="external noopener nofollow">this https URL</a>.

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs

Token-level Direct Preference Optimization

VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Multimodal Preference Data Synthetic Alignment with Reward Model

Mitigating Object Hallucination via Concentric Causal Attention

DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer

Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding