A Survey of Hallucination in Large Visual Language Models

Wei Lan,Wenyi Chen,Qingfeng Chen,Shirui Pan,Huiyu Zhou,Yi Pan
2024-10-20
Abstract:The Large Visual Language Models (LVLMs) enhances user interaction and enriches user experience by integrating visual modality on the basis of the Large Language Models (LLMs). It has demonstrated their powerful information processing and generation capabilities. However, the existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields. Although lots of work has been devoted to the issue of hallucination mitigation and correction, there are few reviews to summary this issue. In this survey, we first introduce the background of LVLMs and hallucinations. Then, the structure of LVLMs and main causes of hallucination generation are introduced. Further, we summary recent works on hallucination correction and mitigation. In addition, the available hallucination evaluation benchmarks for LVLMs are presented from judgmental and generative perspectives. Finally, we suggest some future research directions to enhance the dependability and utility of LVLMs.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of hallucination in large vision-language models (LVLMs). By integrating visual modalities, LVLMs enhance user interaction and experience based on natural language processing (NLP), demonstrating powerful information processing and generation capabilities. However, the presence of hallucinations limits the potential and practical effectiveness of LVLMs in various fields. Despite numerous efforts to mitigate and correct hallucinations, there is a lack of systematic review on this issue. ### Main Content 1. **Background Introduction**: - **Background of LVLMs and Hallucination**: Introduces the basic structure of LVLMs and the main causes of hallucination. - **Structure of LVLMs**: LVLMs can be divided into perception modules, cross-modal modules, and response modules. The perception module typically uses Vision Transformer (ViT) to convert images into high-dimensional vectors, the cross-modal module bridges the modality gap between vision and language, and the response module is responsible for generating the final response. 2. **Main Causes of Hallucination**: - **Modality Gap**: Significant differences in distribution, features, and semantics between different modalities lead to biases in the response module when understanding image inputs. - **Toxicity in Datasets**: Since LVLMs rely on large amounts of data for training, misleading samples in the dataset can cause LVLMs to generate hallucinations. - **LLM Hallucination**: As the "brain" of LVLM, LLMs are prone to generating hallucinations. When parameter knowledge is incorrect or conflicts with received visual information, it also leads to hallucinations. 3. **Methods for Correcting and Mitigating Hallucination**: - **Dataset De-hallucination**: Cleans existing datasets of hallucinations through methods such as data rewriting, removing overconfidence, and breaking co-occurrence relationships. - **Modality Gap**: Enhances LVLM's understanding of visual information and reduces hallucination generation through methods such as visual fusion, perception enhancement, and contrastive learning. - **Output Correction**: Directly corrects generated hallucinations through methods such as post-generation correction, reinforcement learning from human feedback (RLHF), and direct policy optimization (DPO). 4. **Evaluation Benchmarks**: - **Judgmental Benchmarks**: Evaluates hallucination phenomena in LVLMs from a judgment perspective. - **Generative Benchmarks**: Evaluates hallucination phenomena in LVLMs from a generative perspective. 5. **Future Research Directions**: - Proposes some future research directions to enhance the reliability and practicality of LVLMs. ### Conclusion To establish a trustworthy LVLM, it is necessary to overcome the obstacle of hallucination. The paper summarizes recent progress in the phenomenon of hallucination in LVLMs and proposes future research directions to improve the reliability and practicality of LVLMs.