VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Kang-il Lee,Minbeom Kim,Seunghyun Yoon,Minsung Kim,Dongryeol Lee,Hyukhun Koh,Kyomin Jung

2024-07-11

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the language prior problem existing in large vision - language models (LVLMs). Specifically, when dealing with multimodal tasks, these models tend to generate responses based on text patterns while ignoring image information. This phenomenon may lead to unwanted biases or hallucinations, especially when dealing with images outside the training distribution. Although this problem is very important, currently, there is still a lack of research on how to accurately measure the language prior in LVLMs. Although the existing benchmarks based on counterfactual or out - of - distribution images can be partially used to measure the language prior, they cannot distinguish the language prior from other confounding factors. For this reason, the paper proposes a new benchmark, VLind - Bench, which is specifically designed to measure the language prior or "blind spots" of LVLMs. This benchmark not only includes tests on counterfactual images to evaluate the language prior, but also involves a series of tests to evaluate more fundamental abilities, such as common - sense knowledge, visual perception, and common - sense biases. In this way, VLind - Bench aims to reduce the influence of other factors on the evaluation, thereby measuring the language prior problem more accurately.

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Revisiting the Role of Language Priors in Vision-Language Models

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Unveiling the Tapestry of Consistency in Large Vision-Language Models

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

DevBench: A multimodal developmental benchmark for language learning

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations