Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou,Chenhang Cui,Jaehong Yoon,Linjun Zhang,Zhun Deng,Chelsea Finn,Mohit Bansal,Huaxiu Yao
2024-03-17
Abstract:Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at <a class="link-external link-https" href="https://github.com/YiyangZhou/LURE" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of object hallucination in Large Vision-Language Models (LVLMs). Specifically, object hallucination refers to the inclusion of objects in the generated descriptions that do not actually exist in the image, which can mislead users and negatively impact applications that rely on these descriptions, such as robotics, medical imaging, and human-computer interaction. ### Background and Motivation Although large vision-language models have made significant progress in understanding real-world images and show potential for achieving general artificial intelligence, they still suffer from the problem of object hallucination. This hallucination phenomenon leads to the generation of inaccurate descriptions, including non-existent objects or the omission of important features. This not only affects tasks such as visual summarization and reasoning but also misleads and deceives users in downstream applications. ### Solution To address this issue, the authors propose a simple yet powerful algorithm called the LVLM Hallucination Rectifier (LURE) to correct object hallucinations in the post-generation phase of LVLMs. LURE is based on a rigorous statistical analysis of the key factors leading to object hallucination, which include: 1. **Co-occurrence**: Certain objects frequently appear together in images. 2. **Uncertainty**: Objects with higher uncertainty during the generation process are more likely to appear as hallucinations. 3. **Object Position**: Hallucinations often appear in the latter part of the generated text. LURE can be seamlessly integrated into any LVLM and corrects object hallucinations by reconstructing descriptions with fewer hallucinations. The authors evaluated the performance of LURE on six open-source LVLMs and found that it outperformed previous best methods in general object hallucination evaluation metrics, GPT evaluation, and human evaluation. ### Method Overview The core of LURE lies in correcting potential hallucinated descriptions and converting them into accurate descriptions. The specific steps are as follows: 1. **Data Preparation**: Generate hallucinated descriptions using GPT-3.5 by introducing potentially co-occurring objects, replacing uncertain objects, or objects located at the end of the description to modify the original descriptions. 2. **Training the Hallucination Rectifier**: Fine-tune the LVLM using the generated hallucinated dataset to train a rectifier capable of correcting hallucinated descriptions. 3. **Inference Phase**: During the generation of descriptions, insert placeholder tags `[IDK]` to mark objects with high uncertainty or those located at the end of the description, then use the trained rectifier to re-evaluate and correct these objects. ### Experimental Results The authors validated the effectiveness of LURE through various evaluation metrics (such as CHAIR, BLEU, and CLIP scores) as well as human and GPT evaluations. The experimental results show that LURE significantly reduces object hallucinations across multiple LVLMs, outperforming other baseline methods. Additionally, by comparing with methods that only use additional data for fine-tuning, the authors demonstrated that the performance improvement of LURE is not solely due to the use of additional data but rather its unique rectification mechanism. ### Conclusion The paper proposes an effective post-processing method, LURE, to correct the problem of object hallucination in LVLMs. Through rigorous statistical analysis and experimental validation, LURE performs excellently across multiple evaluation metrics, providing a new solution to improve the reliability and accuracy of LVLMs.