An Empirical Study Into What Matters for Calibrating Vision-Language Models

Weijie Tu,Weijian Deng,Dylan Campbell,Stephen Gould,Tom Gedeon
2024-06-14
Abstract:Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the calibration performance of Vision - Language Models (VLMs) under different architectures, datasets, and training strategies, especially in risk - sensitive application scenarios. Specifically, the researchers focus on the uncertainty estimation ability of VLMs, which is an important aspect of ensuring the reliability and effectiveness of these models in practical applications. ### Core problems of the paper 1. **Calibration of uncertainty**: - VLMs perform well in zero - shot recognition tasks, but when deployed in high - risk fields (such as medical, autonomous driving, etc.), accurate estimation of their uncertainty is required. - The researchers found that VLMs do not inherently have good uncertainty calibration capabilities, so it is necessary to explore how to improve this through calibration methods (such as temperature scaling). 2. **Robustness to cross - domain and label set changes**: - This study explores the performance of VLMs on different domains, label sets, or hierarchical levels after calibration on one domain, label set, or hierarchical level. - The research shows that temperature scaling can significantly improve the calibration performance of VLMs, even in the case of distribution changes and label set changes. 3. **Effectiveness of few - shot calibration**: - The study found that VLMs can be effectively calibrated with a very small number of samples, which provides convenience for practical applications, especially when labeled data is scarce. 4. **Impact of prompt words**: - The study also explored the impact of different text prompts on the calibration effect of VLMs. The results show that simple prompt words (such as "a photo of a <class>") are sufficient to achieve good uncertainty estimation. ### Formula representation - **Temperature scaling formula**: Temperature scaling changes the sharpness of output probabilities by adjusting logits. The new predicted confidence is: \[ \hat{p} = \max_i \frac{\exp(g_i(x) / T)}{\sum_{j = 1}^n \exp(g_j(x) / T)} \] where \( T \) is the temperature parameter. A higher \( T \) value will make the probability distribution smoother, and a lower \( T \) value will make the probability distribution sharper. - **Expected Calibration Error (ECE)**: ECE is used to evaluate the calibration performance of the model. The calculation formula is: \[ \text{ECE} = \sum_{m = 1}^M \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{avgConf}(B_m) \right| \] where \( B_m \) represents the sample set within the \( m \)-th confidence interval, \( n \) is the total number of samples, and \(\text{acc}(B_m)\) and \(\text{avgConf}(B_m)\) represent the accuracy rate and average confidence within this interval respectively. ### Conclusion This paper, through extensive experiments and analysis, reveals the potential of VLMs in calibration and their robustness under different conditions. The research results show that calibration methods such as temperature scaling can significantly improve the uncertainty estimation performance of VLMs, making them more reliable and effective in practical applications.