Abstract:Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Addressing this oversight, we present a benchmark incorporating uncertainty quantification into evaluating VLMs. Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question Answering (VQA) task. We examine models on 5 datasets that evaluate various vision-language capabilities. Using conformal prediction as an uncertainty estimation approach, we demonstrate that the models' uncertainty is not aligned with their accuracy. Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs. Our empirical findings also reveal a correlation between model uncertainty and its language model part.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of ignoring uncertainty in the evaluation of Vision - Language Models (VLMs). Current evaluation methods mainly focus on the performance of models in various vision - language tasks, but overlook this crucial factor of uncertainty. Uncertainty is essential for a comprehensive evaluation of VLMs because it can provide additional information about the reliability of model predictions. ### Specific problems 1. **Uncertainty quantification**: - Current evaluation methods do not consider the uncertainty of models, which may lead to an incomplete understanding of model performance. - The author proposes a benchmarking method that incorporates uncertainty quantification into the evaluation of VLMs. 2. **Relationship between uncertainty and accuracy**: - The research finds that a model with high accuracy may have high uncertainty, indicating that there is not always a positive correlation between accuracy and uncertainty. - This finding emphasizes the importance of considering both accuracy and uncertainty when evaluating VLMs. 3. **Model calibration**: - Model calibration refers to whether the prediction confidence of a model is consistent with its actual correct rate. - The author uses indicators such as Expected Calibration Error (ECE) to evaluate the calibration of the model. ### Solutions 1. **Benchmarking**: - The author prepared five different datasets, covering various vision - language tasks, especially the Multiple - Choice Visual Question Answering (MCQA) task. - These datasets were used to conduct extensive evaluations on more than 20 VLMs. 2. **Uncertainty estimation method**: - Conformal prediction is adopted as the method for uncertainty estimation, which is a theoretically guaranteed and practical technique. - Two scoring functions (Least Ambiguous set - valued Classifiers, LAC and Adaptive Prediction Sets, APS) are used to calculate prediction sets, and the uncertainty is evaluated by averaging the sizes of these prediction sets. 3. **Comprehensive evaluation indicator**: - A new indicator that combines accuracy and uncertainty - Uncertainty - aware Accuracy (UAcc) - is proposed to evaluate model performance more comprehensively. ### Experimental results - **Accuracy and uncertainty**: - The experimental results show that a model with high accuracy does not necessarily have low uncertainty, and vice versa. - For example, the Monkey - Chat model ranks second in terms of accuracy but ninth in terms of uncertainty, indicating that its uncertainty is relatively high. - **Model ranking changes**: - Due to the influence of uncertainty, the rankings of some models under the UAcc indicator are significantly different from those when only accuracy is considered. - For example, the MoE - LLaV A - Phi2 - 2.7B model ranks eighth in terms of accuracy but fourth in terms of UAcc because its uncertainty is relatively low. ### Conclusion This paper provides a more comprehensive evaluation framework for VLMs by introducing the method of uncertainty quantification. The research results emphasize the necessity of considering uncertainty when evaluating VLMs, which helps to better understand and apply these models.

Uncertainty-Aware Evaluation for Vision-Language Models

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Benchmarking LLMs via Uncertainty Quantification

Post-hoc Probabilistic Vision-Language Models

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness

An Empirical Study Into What Matters for Calibrating Vision-Language Models

VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation

Uncertainty in Language Models: Assessment through Rank-Calibration

Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models

Improving Medical Diagnostics with Vision-Language Models: Convex Hull-Based Uncertainty Analysis

Look before you leap: An exploratory study of uncertainty measurement for large language models

Perceptions of Linguistic Uncertainty by Language Models and Humans

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Uncertainty-aware Evaluation of Machine Learning Performance in binary Classification Tasks

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation

Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning