Uncertainty-Aware Evaluation for Vision-Language Models

Vasily Kostumov,Bulat Nutfullin,Oleg Pilipenko,Eugene Ilyushin
2024-02-22
Abstract:Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Addressing this oversight, we present a benchmark incorporating uncertainty quantification into evaluating VLMs. Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question Answering (VQA) task. We examine models on 5 datasets that evaluate various vision-language capabilities. Using conformal prediction as an uncertainty estimation approach, we demonstrate that the models' uncertainty is not aligned with their accuracy. Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs. Our empirical findings also reveal a correlation between model uncertainty and its language model part.
Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of ignoring uncertainty in the evaluation of Vision - Language Models (VLMs). Current evaluation methods mainly focus on the performance of models in various vision - language tasks, but overlook this crucial factor of uncertainty. Uncertainty is essential for a comprehensive evaluation of VLMs because it can provide additional information about the reliability of model predictions. ### Specific problems 1. **Uncertainty quantification**: - Current evaluation methods do not consider the uncertainty of models, which may lead to an incomplete understanding of model performance. - The author proposes a benchmarking method that incorporates uncertainty quantification into the evaluation of VLMs. 2. **Relationship between uncertainty and accuracy**: - The research finds that a model with high accuracy may have high uncertainty, indicating that there is not always a positive correlation between accuracy and uncertainty. - This finding emphasizes the importance of considering both accuracy and uncertainty when evaluating VLMs. 3. **Model calibration**: - Model calibration refers to whether the prediction confidence of a model is consistent with its actual correct rate. - The author uses indicators such as Expected Calibration Error (ECE) to evaluate the calibration of the model. ### Solutions 1. **Benchmarking**: - The author prepared five different datasets, covering various vision - language tasks, especially the Multiple - Choice Visual Question Answering (MCQA) task. - These datasets were used to conduct extensive evaluations on more than 20 VLMs. 2. **Uncertainty estimation method**: - Conformal prediction is adopted as the method for uncertainty estimation, which is a theoretically guaranteed and practical technique. - Two scoring functions (Least Ambiguous set - valued Classifiers, LAC and Adaptive Prediction Sets, APS) are used to calculate prediction sets, and the uncertainty is evaluated by averaging the sizes of these prediction sets. 3. **Comprehensive evaluation indicator**: - A new indicator that combines accuracy and uncertainty - Uncertainty - aware Accuracy (UAcc) - is proposed to evaluate model performance more comprehensively. ### Experimental results - **Accuracy and uncertainty**: - The experimental results show that a model with high accuracy does not necessarily have low uncertainty, and vice versa. - For example, the Monkey - Chat model ranks second in terms of accuracy but ninth in terms of uncertainty, indicating that its uncertainty is relatively high. - **Model ranking changes**: - Due to the influence of uncertainty, the rankings of some models under the UAcc indicator are significantly different from those when only accuracy is considered. - For example, the MoE - LLaV A - Phi2 - 2.7B model ranks eighth in terms of accuracy but fourth in terms of UAcc because its uncertainty is relatively low. ### Conclusion This paper provides a more comprehensive evaluation framework for VLMs by introducing the method of uncertainty quantification. The research results emphasize the necessity of considering uncertainty when evaluating VLMs, which helps to better understand and apply these models.