HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang,Garrett Bingham,Adams Yu,Quoc Le,Thang Luong,Golnaz Ghiasi
2024-07-22
Abstract:Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the hallucination problem in vision - language models (VLMs). Hallucination refers to factually incorrect or inconsistent information generated by the model, which is a major problem in large - language models and is especially crucial in a multimodal environment because vision - language models need to process text and visual inputs. Although vision - language models have made rapid progress, resources for evaluating and solving multimodal hallucination are still limited and mainly focused on evaluation. To this end, the authors introduce HaloQuest, a new visual question - answering dataset designed to capture various aspects of multimodal hallucination, such as false premises, insufficient context, and visual challenges. One innovation of HaloQuest is the use of synthetic images (in addition to real images) to enable large - scale dataset creation. The dataset contains more than 7,700 instances, covering a wide range of categories, and is designed to be a challenging benchmark for VLMs and a fine - tuning dataset for improving multimodal reasoning. Through experiments, the authors found that current models perform poorly on HaloQuest, with the accuracy of all open - source VLMs being lower than 36%. However, fine - tuning on HaloQuest can significantly reduce the hallucination rate while maintaining performance on standard reasoning tasks. In addition, the authors also propose a new automatic evaluation mechanism (Auto - Eval), which has a correlation as high as 0.99 with human raters, for evaluating VLMs. In summary, this paper makes an important contribution to understanding, evaluating, and reducing the hallucination phenomenon in VLMs by introducing the HaloQuest dataset and its associated evaluation mechanism, laying the foundation for the future development of more reliable and trustworthy multimodal AI systems.