Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

Chenhang Cui,Yiyang Zhou,Xinyu Yang,Shirley Wu,Linjun Zhang,James Zou,Huaxiu Yao
2023-11-07
Abstract:While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at <a class="link-external link-https" href="https://github.com/gzcch/Bingo" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to systematically evaluate and analyze the hallucination behavior of the GPT - 4V(ision) model when processing visual and textual information. Specifically, the author attempts to solve the following two main problems: 1. **Evaluation of hallucination behavior**: - GPT - 4V(ision) is prone to produce hallucinatory outputs when processing visual and textual information. To fill this research gap, the author introduces a new benchmark - Bias and Interference Challenges in Visual Language Models (Bingo), which is used to evaluate and reveal two common types of hallucinations in visual - language models: bias and interference. 2. **Causes and solutions of hallucination behavior**: - **Bias**: It means that the model tends to generate hallucinatory outputs for certain types of inputs, which may be due to the imbalance of training data. For example, region bias is manifested as GPT - 4V(ision) performing better when interpreting Western images or images containing English text, and performing worse when interpreting images from other regions or containing texts in other languages. - **Interference**: It means that when the wording of the text prompt or the presentation method of the input image changes, the judgment of GPT - 4V(ision) may be interfered, making it more likely to produce hallucinations. For example, when there are leading questions or when interpreting multiple images simultaneously, the model is easily confused. In addition, the author also explores existing methods for alleviating hallucinations, such as self - correction and chain - of - thought reasoning, but finds that these methods have limited effectiveness in solving the hallucination problem of GPT - 4V(ision). Therefore, the author emphasizes the need to develop new solutions to meet these challenges. ### Main contributions - **Introduction of the Bingo benchmark**: This benchmark includes 190 failure instances and 131 success instances, covering three types of bias (region bias, OCR bias, fact bias) and two types of interference (image - to - image interference, text - to - image interference). - **Empirical analysis**: Through empirical analysis of the performance of GPT - 4V(ision) on the Bingo benchmark, the main causes of its hallucination behavior are revealed, and it is compared with other visual - language models (such as LLaVA and Bard). - **Exploration of mitigation strategies**: The effectiveness of existing methods such as self - correction and chain - of - thought reasoning is evaluated, and it is found that these methods play a certain role in reducing hallucinations but still have limitations. Through these studies, the author hopes to provide new insights and tools for understanding and solving the hallucination problem in visual - language models.