MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Zishan Gu,Changchang Yin,Fenglin Liu,Ping Zhang

2024-07-03

Abstract:Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of hallucinations in large-scale vision-language models (LVLMs) fine-tuned on small datasets when handling medical tasks. Specifically, although these models perform well on standard medical tasks, they are prone to generating incorrect information when faced with misleading inputs. To tackle this problem, the research team developed a new benchmark dataset called the Medical Visual Hallucination Test (MedVH), aimed at evaluating the resistance of these models to hallucinations in medical settings. MedVH includes 5 different types of tasks, covering various aspects from text-visual understanding to long text generation, to comprehensively assess the models' performance. Experimental results indicate that while domain-specific LVLMs perform well on certain standard tasks, they are still prone to errors when dealing with misleading inputs, posing challenges to their reliability in real-world medical applications. Through this study, the authors hope to promote the development of more reliable and trustworthy language models and facilitate their practical application in real medical scenarios.

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Evaluation and Analysis of Hallucination in Large Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

A Survey of Hallucination in Large Visual Language Models

A Survey on Hallucination in Large Vision-Language Models

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Hallucination of Multimodal Large Language Models: A Survey

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Med-HALT: Medical Domain Hallucination Test for Large Language Models

Hallucination Benchmark in Medical Visual Question Answering

Evaluating Object Hallucination in Large Vision-Language Models

Reference-free Hallucination Detection for Large Vision-Language Models

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models