VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Haoyi Qiu,Wenbo Hu,Zi-Yi Dou,Nanyun Peng
2024-10-04
Abstract:Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose a large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human-annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the hallucination problem that occurs when large vision - language models (LVLMs) generate text describing images. Hallucination refers to the model generating text descriptions that sound reasonable but are actually wrong or fictional, which undermines the reliability and credibility of the model. Specifically, existing evaluation benchmarks and methods are usually limited to hallucinations in object recognition, ignoring other types of hallucinations such as attributes and relationships. In addition, current evaluation methods have difficulty effectively dealing with the subtle semantic differences between model outputs and reference data, as well as the balance between hallucination and information volume. To address these issues, the paper proposes a multi - dimensional benchmark, VALOR - BENCH, which covers objects, attributes, and relationships, and selects challenging images based on association bias. At the same time, the paper also proposes a two - stage evaluation framework based on large language models (LLM), VALOR - EVAL, which extends the popular CHAIR metric and incorporates fidelity and coverage into the evaluation. Experimental results show that this evaluation metric is more comprehensive and more highly correlated with human judgment when evaluating 10 established LVLMs than existing work. ### Main contributions: 1. **VALOR - BENCH**: A comprehensive manually - annotated dataset that covers relationships, attributes, and objects, and selects challenging images based on association bias. 2. **VALOR - EVAL**: A two - stage evaluation framework based on LLM that extends the previous CHAIR method, considers the trade - off between precision and information volume, and can handle the evaluation of objects, attributes, and relationships in an open - vocabulary setting. 3. **Comprehensive evaluation**: Conduct a multi - dimensional evaluation of 10 mainstream LVLMs, focusing on the balance between fidelity and coverage. The study found that even models like GPT - 4V have hallucination problems. Although it covers more information in the image, its fidelity score is relatively low. Through these contributions, the paper aims to promote the community to focus on achieving a balance between fidelity and coverage in LVLMs to improve the reliability and practicality of the models.