Abstract:Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the calibration performance of Vision - Language Models (VLMs) under different architectures, datasets, and training strategies, especially in risk - sensitive application scenarios. Specifically, the researchers focus on the uncertainty estimation ability of VLMs, which is an important aspect of ensuring the reliability and effectiveness of these models in practical applications. ### Core problems of the paper 1. **Calibration of uncertainty**: - VLMs perform well in zero - shot recognition tasks, but when deployed in high - risk fields (such as medical, autonomous driving, etc.), accurate estimation of their uncertainty is required. - The researchers found that VLMs do not inherently have good uncertainty calibration capabilities, so it is necessary to explore how to improve this through calibration methods (such as temperature scaling). 2. **Robustness to cross - domain and label set changes**: - This study explores the performance of VLMs on different domains, label sets, or hierarchical levels after calibration on one domain, label set, or hierarchical level. - The research shows that temperature scaling can significantly improve the calibration performance of VLMs, even in the case of distribution changes and label set changes. 3. **Effectiveness of few - shot calibration**: - The study found that VLMs can be effectively calibrated with a very small number of samples, which provides convenience for practical applications, especially when labeled data is scarce. 4. **Impact of prompt words**: - The study also explored the impact of different text prompts on the calibration effect of VLMs. The results show that simple prompt words (such as "a photo of a <class>") are sufficient to achieve good uncertainty estimation. ### Formula representation - **Temperature scaling formula**: Temperature scaling changes the sharpness of output probabilities by adjusting logits. The new predicted confidence is: \[ \hat{p} = \max_i \frac{\exp(g_i(x) / T)}{\sum_{j = 1}^n \exp(g_j(x) / T)} \] where \( T \) is the temperature parameter. A higher \( T \) value will make the probability distribution smoother, and a lower \( T \) value will make the probability distribution sharper. - **Expected Calibration Error (ECE)**: ECE is used to evaluate the calibration performance of the model. The calculation formula is: \[ \text{ECE} = \sum_{m = 1}^M \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{avgConf}(B_m) \right| \] where \( B_m \) represents the sample set within the \( m \)-th confidence interval, \( n \) is the total number of samples, and \(\text{acc}(B_m)\) and \(\text{avgConf}(B_m)\) represent the accuracy rate and average confidence within this interval respectively. ### Conclusion This paper, through extensive experiments and analysis, reveals the potential of VLMs in calibration and their robustness under different conditions. The research results show that calibration methods such as temperature scaling can significantly improve the uncertainty estimation performance of VLMs, making them more reliable and effective in practical applications.

An Empirical Study Into What Matters for Calibrating Vision-Language Models

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Uncertainty-Aware Evaluation for Vision-Language Models

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

Post-hoc Probabilistic Vision-Language Models

Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Calibrated Large Language Models for Binary Question Answering

Selectively Answering Visual Questions

Open-Vocabulary Calibration for Fine-tuned CLIP

A Study on the Calibration of In-context Learning

Uncertainty in Language Models: Assessment through Rank-Calibration

Large Language Models Must Be Taught to Know What They Don't Know

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Calibrating Verbalized Probabilities for Large Language Models

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Vision-Language Models for Vision Tasks: A Survey

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling