Abstract:Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. The reliable detection of such hallucinations in MLLMs has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. Prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. In response to these challenges, our work expands the investigative horizons of hallucination detection. We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. Additionally, we unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly. We demonstrate the effectiveness of UNIHD through meticulous evaluation and comprehensive analysis. We also provide strategic insights on the application of specific tools for addressing various categories of hallucinations.

What problem does this paper attempt to address?

This paper attempts to solve the problem of hallucination detection in Multimodal Large Language Models (MLLMs). Specifically, although MLLMs have made significant progress in multimodal tasks, they are prone to generate content that contradicts input data or known facts, and this phenomenon is called "hallucination". Reliable hallucination detection is crucial for evaluating model performance and ensuring the safety of practical applications. ### Main Problems and Challenges of the Paper 1. **Single - task nature**: Existing research mainly focuses on specific tasks, such as image captioning, while ignoring other important tasks such as text - to - image generation. 2. **Limited types of hallucination**: Most previous studies only focused on object - level hallucination, while ignoring other types of hallucination such as scene text and factual inconsistency. 3. **Insufficient granularity**: Existing hallucination detection methods usually conduct overall evaluation on the entire response, lacking fine - grained analysis of each claim in the response. ### Solutions To solve the above problems, the author proposes the following solutions: 1. **MHaluBench Benchmark**: A meta - evaluation benchmark has been constructed, covering multiple hallucination categories and multimodal tasks, aiming to evaluate hallucination detection methods more comprehensively. 2. **UNIHD Framework**: A Unified Hallucination Detection (UNIHD) framework is proposed, which uses a series of auxiliary tools to verify the occurrence of hallucination. This framework includes four main steps: - **Core Claim Extraction**: Extract core claims from the generated response or user query. - **Autonomous Tool Selection**: Select appropriate tools by automatically generating related questions to verify each claim. - **Parallel Tool Execution**: Deploy multiple dedicated tools to run concurrently, providing evidence to reliably verify potential hallucinations. - **Hallucination Verification and Explanation**: Summarize the collected evidence, guide the underlying MLLM to determine whether the claim is a hallucination, and provide an explanation. ### Experimental Results The experimental results show that the UNIHD framework is significantly superior to the baseline methods in multiple evaluation metrics, especially in image - to - text and text - to - image generation tasks. Although MHaluBench is a challenging benchmark, UNIHD demonstrates its effectiveness in fine - grained hallucination detection. In conclusion, this paper provides a more unified and comprehensive solution for hallucination detection in multimodal large - language models by introducing new benchmarks and frameworks, promoting the further development of this field.

Unified Hallucination Detection for Multimodal Large Language Models

Unified Hallucination Detection for Multimodal Large Language Models

Hallucination of Multimodal Large Language Models: A Survey

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

Cost-Effective Hallucination Detection for LLMs

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Evaluation and Analysis of Hallucination in Large Vision-Language Models

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation