Skip : A Simple Method to Reduce Hallucination in Large Vision-Language Models

Zongbo Han,Zechen Bai,Haiyang Mei,Qianli Xu,Changqing Zhang,Mike Zheng Shou
2024-01-01
Abstract:Recent advancements in large vision-language models (LVLMs) have demonstratedimpressive capability in visual information understanding with human language.Despite these advances, LVLMs still face challenges with multimodalhallucination, such as generating text descriptions of objects that are notpresent in the visual information. However, the underlying fundamental reasonsof multimodal hallucinations remain poorly explored. In this paper, we proposea new perspective, suggesting that the inherent biases in LVLMs might be a keyfactor in hallucinations. Specifically, we systematically identify a semanticshift bias related to paragraph breaks (), where the content before andafter '' in the training data frequently exhibit significant semanticchanges. This pattern leads the model to infer that the contents following'' should be obviously different from the preceding contents with lesshallucinatory descriptions, thereby increasing the probability of hallucinatorydescriptions subsequent to the ''. We have validated this hypothesis onmultiple publicly available LVLMs. Besides, we find that deliberately inserting'' at the generated description can induce more hallucinations. A simplemethod is proposed to effectively mitigate the hallucination of LVLMs byskipping the output of ''.
What problem does this paper attempt to address?