Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Danfeng Guo,Demetri Terzopoulos

2024-07-31

Abstract:Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper primarily aims to address the hallucination problem in Medical Visual Question Answering (MVQA) tasks involving Medical Large Vision-Language Models (MLVLMs), especially in cases where they perform poorly in diagnosing complex pathologies. Specifically: 1. **Hallucination Problem**: MLVLMs generate content that is inconsistent with the input medical images, leading to incorrect diagnoses. 2. **Difficulty in Learning Minority Pathologies**: Due to imbalanced training data, the model struggles to learn features of minority pathologies. 3. **Proposing Two Prompting Strategies**: - The first strategy involves providing detailed pathological explanations when asking questions to help the model better understand pathological features. - The second strategy introduces a low-cost weak learner to provide reference judgments in the prompts, reducing false positive predictions. The researchers validated the effectiveness of these strategies on the MIMIC-CXR-JPG and CheXpert datasets, significantly improving the diagnostic F1 score. Additionally, these strategies can be extended to general LVLMs to enhance their performance.

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Targeted Visual Prompting for Medical Visual Question Answering

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Hallucination Benchmark in Medical Visual Question Answering

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models

Mitigating Multilingual Hallucination in Large Vision-Language Models

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Aligning Medical Images with General Knowledge from Large Language Models

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)