Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Junjie Wu,Tsz Ting Chung,Kai Chen,Dit-Yan Yeung
2024-10-30
Abstract:Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, in this paper we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to conduct hallucination evaluation on (object, relation, object) triplets extracted from LVLMs' responses, and thus, could be easily generalized to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. We conduct comprehensive evaluations on Tri-HE and observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple yet effective training-free approach to mitigate hallucinations for LVLMs, with which, we exceed all open-sourced counterparts on Tri-HE, achieving comparable performance with the powerful GPT-4V. Our dataset and code for the reproduction of our experiments are available publicly at <a class="link-external link-https" href="https://github.com/wujunjie1998/Tri-HE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
According to your request, the problem that this paper attempts to address can be summarized as follows: **Problem the paper attempts to solve**: Large Vision-Language Models (LVLMs) perform excellently in vision-language reasoning but may generate hallucinated content that does not exist in the given image, known as "hallucinations." Existing LVLM hallucination benchmarks mainly focus on object-related hallucinations, while there is a lack of research on hallucinations regarding the relationships between two objects (i.e., relational hallucinations). To address this gap, this paper designs a unified framework to measure both object hallucinations and relational hallucinations in LVLMs simultaneously. Specifically, the framework evaluates hallucinations by extracting (object, relation, object) triples from LVLMs' responses and can be easily extended to different vision-language tasks. Based on this framework, the authors further introduce Tri-HE, a new triple-level hallucination evaluation benchmark that can study both object hallucinations and relational hallucinations simultaneously. Through this framework and benchmark, the authors find that the problem of relational hallucinations is more severe than object hallucinations, highlighting a key issue in improving the reliability of LVLMs. Additionally, based on the research findings, the authors design a simple yet effective training-free method to reduce hallucinations in LVLMs, which outperforms all open-source models on Tri-HE and performs comparably to the powerful GPT-4V.