Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li,Yifan Du,Kun Zhou,Jinpeng Wang,Wayne Xin Zhao,Ji-Rong Wen

2023-10-26

Abstract:Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at <a class="link-external link-https" href="https://github.com/RUCAIBox/POPE" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language,Multimedia

What problem does this paper attempt to address?

The paper aims to address the issue of object hallucination in large vision-language models (LVLM) when generating image descriptions. Specifically: 1. **Systematic Study of Object Hallucination**: The paper conducts the first systematic study of object hallucination in LVLMs. The study finds that existing LVLMs often generate descriptions that include objects inconsistent with or nonexistent in the target image. 2. **Limitations of Evaluation Methods**: The paper points out the flaws in existing evaluation methods (such as CHAIR), which are susceptible to the design of instructions and the length of generated descriptions. These methods also require complex manual rules to match generated objects, potentially leading to misclassification errors. 3. **Proposing a New Evaluation Method POPE**: To more stably and flexibly evaluate the object hallucination problem, the authors propose a polling-based object probing evaluation method (POPE). POPE asks LVLMs simple yes/no questions (e.g., "Is there a chair in the image?"), thereby avoiding many issues present in traditional methods. 4. **Analyzing the Causes**: The study also explores the reasons behind object hallucination, such as the tendency of LVLMs to generate objects that frequently appear in the training data or co-occur with objects already present in the image. Through these studies, the paper reveals the challenges that LVLMs may face in practical applications and proposes a more effective evaluation method to help future researchers better understand and improve these models.

Evaluating Object Hallucination in Large Vision-Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Multi-Object Hallucination in Vision-Language Models

A Survey on Hallucination in Large Vision-Language Models

Evaluation and Analysis of Hallucination in Large Vision-Language Models

A Survey of Hallucination in Large Visual Language Models

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Hallucination of Multimodal Large Language Models: A Survey

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Mitigating Multilingual Hallucination in Large Vision-Language Models

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models