Abstract:While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at <a class="link-external link-https" href="https://github.com/gzcch/Bingo" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to systematically evaluate and analyze the hallucination behavior of the GPT - 4V(ision) model when processing visual and textual information. Specifically, the author attempts to solve the following two main problems: 1. **Evaluation of hallucination behavior**: - GPT - 4V(ision) is prone to produce hallucinatory outputs when processing visual and textual information. To fill this research gap, the author introduces a new benchmark - Bias and Interference Challenges in Visual Language Models (Bingo), which is used to evaluate and reveal two common types of hallucinations in visual - language models: bias and interference. 2. **Causes and solutions of hallucination behavior**: - **Bias**: It means that the model tends to generate hallucinatory outputs for certain types of inputs, which may be due to the imbalance of training data. For example, region bias is manifested as GPT - 4V(ision) performing better when interpreting Western images or images containing English text, and performing worse when interpreting images from other regions or containing texts in other languages. - **Interference**: It means that when the wording of the text prompt or the presentation method of the input image changes, the judgment of GPT - 4V(ision) may be interfered, making it more likely to produce hallucinations. For example, when there are leading questions or when interpreting multiple images simultaneously, the model is easily confused. In addition, the author also explores existing methods for alleviating hallucinations, such as self - correction and chain - of - thought reasoning, but finds that these methods have limited effectiveness in solving the hallucination problem of GPT - 4V(ision). Therefore, the author emphasizes the need to develop new solutions to meet these challenges. ### Main contributions - **Introduction of the Bingo benchmark**: This benchmark includes 190 failure instances and 131 success instances, covering three types of bias (region bias, OCR bias, fact bias) and two types of interference (image - to - image interference, text - to - image interference). - **Empirical analysis**: Through empirical analysis of the performance of GPT - 4V(ision) on the Bingo benchmark, the main causes of its hallucination behavior are revealed, and it is compared with other visual - language models (such as LLaVA and Bard). - **Exploration of mitigation strategies**: The effectiveness of existing methods such as self - correction and chain - of - thought reasoning is evaluated, and it is found that these methods play a certain role in reducing hallucinations but still have limitations. Through these studies, the author hopes to provide new insights and tools for understanding and solving the hallucination problem in visual - language models.

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Visual Hallucinations of Multi-modal Large Language Models

Evaluation and Analysis of Hallucination in Large Vision-Language Models

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

Evaluating Hallucinations in Chinese Large Language Models

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models

Hallucination Detection and Hallucination Mitigation: An Investigation

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models