Abstract:Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose a large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human-annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the hallucination problem that occurs when large vision - language models (LVLMs) generate text describing images. Hallucination refers to the model generating text descriptions that sound reasonable but are actually wrong or fictional, which undermines the reliability and credibility of the model. Specifically, existing evaluation benchmarks and methods are usually limited to hallucinations in object recognition, ignoring other types of hallucinations such as attributes and relationships. In addition, current evaluation methods have difficulty effectively dealing with the subtle semantic differences between model outputs and reference data, as well as the balance between hallucination and information volume. To address these issues, the paper proposes a multi - dimensional benchmark, VALOR - BENCH, which covers objects, attributes, and relationships, and selects challenging images based on association bias. At the same time, the paper also proposes a two - stage evaluation framework based on large language models (LLM), VALOR - EVAL, which extends the popular CHAIR metric and incorporates fidelity and coverage into the evaluation. Experimental results show that this evaluation metric is more comprehensive and more highly correlated with human judgment when evaluating 10 established LVLMs than existing work. ### Main contributions: 1. **VALOR - BENCH**: A comprehensive manually - annotated dataset that covers relationships, attributes, and objects, and selects challenging images based on association bias. 2. **VALOR - EVAL**: A two - stage evaluation framework based on LLM that extends the previous CHAIR method, considers the trade - off between precision and information volume, and can handle the evaluation of objects, attributes, and relationships in an open - vocabulary setting. 3. **Comprehensive evaluation**: Conduct a multi - dimensional evaluation of 10 mainstream LVLMs, focusing on the balance between fidelity and coverage. The study found that even models like GPT - 4V have hallucination problems. Although it covers more information in the image, its fidelity score is relatively low. Through these contributions, the paper aims to promote the community to focus on achieving a balance between fidelity and coverage in LVLMs to improve the reliability and practicality of the models.

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Evaluating Object Hallucination in Large Vision-Language Models

FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language Models

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Evaluation and Analysis of Hallucination in Large Vision-Language Models

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

A Survey of Hallucination in Large Visual Language Models

A Survey on Hallucination in Large Vision-Language Models

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Hallucination of Multimodal Large Language Models: A Survey

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs