Abstract:Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at <a class="link-external link-https" href="https://github.com/hanqiu-hq/LongHalQA" rel="external noopener nofollow">this https URL</a>.

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Multi-modal Situated Reasoning in 3D Scenes

Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

Explore the Hallucination on Low-level Perception for MLLMs

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Situational Awareness Matters in 3D Vision Language Reasoning

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting

Good Questions Help Zero-Shot Image Reasoning

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Multi-Modal Dialogue State Tracking for Playing GuessWhich Game

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Multitask Learning for Visual Question Answering