A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu,Wenyuan Xue,Yifei Chen,Dapeng Chen,Xiutian Zhao,Ke Wang,Liping Hou,Rongjun Li,Wei Peng
2024-05-06
Abstract:Recent development of Large Vision-Language Models (LVLMs) has attracted growing attention within the AI landscape for its practical implementation potential. However, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs. In this comprehensive survey, we dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation. Our scrutiny starts with a clarification of the concept of hallucinations in LVLMs, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in LVLM hallucinations. Subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to LVLMs. Additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. We also critically review existing methods for mitigating hallucinations. The open questions and future directions pertaining to hallucinations within LVLMs are discussed to conclude this survey.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the "illusion" problem in large-scale vision-language models (LVLMs), which refers to the mismatch between the factual content and the generated content when the model processes images and generates text. The researchers conducted a comprehensive investigation into the illusions in LVLMs with the aim of summarizing the problem and promoting the development of future mitigation measures. They first defined the concept of illusions, pointing out that it can manifest as judgement errors or descriptive errors, and showcased different types of illusion symptoms through examples. The paper then discussed the evaluation benchmarks and methods specific to the illusions in LVLMs, as well as the origins of these illusions, including biases in training data, limitations of visual encoders, and modal alignment issues. Additionally, the paper reviewed existing methods for mitigating illusions and proposed future research directions. Overall, the paper aims to facilitate understanding of the illusions in LVLMs and guide the development of more reliable and efficient models.