Abstract:Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the hallucination problem in vision - language models (VLMs). Hallucination refers to factually incorrect or inconsistent information generated by the model, which is a major problem in large - language models and is especially crucial in a multimodal environment because vision - language models need to process text and visual inputs. Although vision - language models have made rapid progress, resources for evaluating and solving multimodal hallucination are still limited and mainly focused on evaluation. To this end, the authors introduce HaloQuest, a new visual question - answering dataset designed to capture various aspects of multimodal hallucination, such as false premises, insufficient context, and visual challenges. One innovation of HaloQuest is the use of synthetic images (in addition to real images) to enable large - scale dataset creation. The dataset contains more than 7,700 instances, covering a wide range of categories, and is designed to be a challenging benchmark for VLMs and a fine - tuning dataset for improving multimodal reasoning. Through experiments, the authors found that current models perform poorly on HaloQuest, with the accuracy of all open - source VLMs being lower than 36%. However, fine - tuning on HaloQuest can significantly reduce the hallucination rate while maintaining performance on standard reasoning tasks. In addition, the authors also propose a new automatic evaluation mechanism (Auto - Eval), which has a correlation as high as 0.99 with human raters, for evaluating VLMs. In summary, this paper makes an important contribution to understanding, evaluating, and reducing the hallucination phenomenon in VLMs by introducing the HaloQuest dataset and its associated evaluation mechanism, laying the foundation for the future development of more reliable and trustworthy multimodal AI systems.

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

Hallucination of Multimodal Large Language Models: A Survey

Visual Hallucinations of Multi-modal Large Language Models

Hallucination Benchmark in Medical Visual Question Answering

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Evaluation and Analysis of Hallucination in Large Vision-Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

A Survey on Hallucination in Large Vision-Language Models