Abstract:Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at <a class="link-external link-https" href="https://github.com/hanqiu-hq/LongHalQA" rel="external noopener nofollow">this https URL</a>.

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Evaluation and Analysis of Hallucination in Large Vision-Language Models

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Evaluating Hallucinations in Chinese Large Language Models

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Hallucination Detection and Hallucination Mitigation: An Investigation

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Fine-grained Hallucination Detection and Editing for Language Models

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Halu-J: Critique-Based Hallucination Judge

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector