Abstract:Cache-based approaches stand out as both effective and efficient for adapting vision-language models (VLMs). Nonetheless, the existing cache model overlooks three crucial aspects. 1) Pre-trained VLMs are mainly optimized for image-text similarity, neglecting the importance of image-image similarity, leading to a gap between pre-training and adaptation. 2) The current cache model is based on the Nadaraya-Watson (N-W) estimator, which disregards the intricate relationships among training samples while constructing weight function. 3) Under the condition of limited samples, the logits generated by cache model are of high uncertainty, directly using these logits without accounting for the confidence could be problematic. This work presents three calibration modules aimed at addressing the above challenges. Similarity Calibration refines the image-image similarity by using unlabeled images. We add a learnable projection layer with residual connection on top of the pre-trained image encoder of CLIP and optimize the parameters by minimizing self-supervised contrastive loss. Weight Calibration introduces a precision matrix into the weight function to adequately model the relation between training samples, transforming the existing cache model to a Gaussian Process (GP) regressor, which could be more accurate than N-W estimator. Confidence Calibration leverages the predictive variances computed by GP Regression to dynamically re-scale the logits of cache model, ensuring that the cache model's outputs are appropriately adjusted based on their confidence levels. Besides, to reduce the high complexity of GPs, we further propose a group-based learning strategy. Integrating the above designs, we propose both training-free and training-required variants. Extensive experiments on 11 few-shot classification datasets validate that the proposed methods can achieve state-of-the-art performance.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the three key deficiencies of existing contrastive vision - language models (VLMs) in few - shot adaptation tasks: 1. **Insufficient Image - Image Similarity**: - Existing pre - trained VLMs (such as CLIP) mainly optimize image - text similarity while ignoring image - image similarity. This leads to a gap between the pre - training stage and the cache model adaptation stage. Specifically, the image embeddings of CLIP have limitations when calculating the cosine similarity between images, which affects the performance of the cache model. - To solve this problem, the authors propose a **Similarity Calibration Module**, which optimizes image - image similarity by using additional unlabeled images. They add some calibration layers to the image encoder of CLIP and use self - supervised contrastive loss to optimize the parameters of these layers. 2. **Limitations of the Weight Function**: - Existing cache models usually use the Gaussian kernel as a weight function, which is actually a Nadaraya - Watson (N - W) estimator. It only considers the relationship between the query and the key, ignoring the relationship between the keys. The N - W estimator cannot effectively capture negative correlations. - To solve this problem, the authors introduce a **Weight Calibration Module**, introducing the precision matrix into the weight function, so as to better model the relationship between the keys. By adjusting the noise variance, the N - W estimator becomes a special case of the proposed method. Therefore, the GP - based cache model is more general and expressive. 3. **Ignoring Prediction Uncertainty**: - Existing cache models directly add the output to the logits of the zero - shot classifier without considering the inherent uncertainty of the cache model. When the test sample is close to the training sample, the output of the cache model should have lower uncertainty; and vice versa. The classical cache model (based on the N - W estimator) cannot quantify prediction uncertainty. - To solve this problem, the authors propose a **Confidence Calibration Module**, using the prediction variance provided by GP regression to dynamically adjust the logits of the cache model. The higher the prediction variance, the greater the uncertainty, and the weights can be adjusted accordingly to make the output of the cache model more reliable. In addition, to reduce the computational complexity of GP, the authors introduce a group - based learning strategy, dividing the categories into different groups and applying the GP model respectively, thereby significantly reducing the computational complexity. Through these improvements, the authors propose a new training - free and training - required cache model variant GPCache and verify its effectiveness on 11 few - shot classification datasets. ### Formula Summary - **Similarity Calibration**: \[ s_c(f_i, f_j)=\phi(f_i; \theta^*)^T \phi(f_j; \theta^*) \] where $\phi(f; \theta)$ is the function of the calibration layer and $\theta^*$ is the optimal parameter. - **Weight Calibration**: \[ \text{logits}'_{cc}=\kappa_c(f, F)(K_c+\sigma^2 I)^{-1}\cdot Y \] where $\kappa_c(f, F)$ is the calibrated weight function, $K_c$ is the covariance matrix, $\sigma^2$ is the noise variance, and $I$ is the unit matrix. - **Confidence Calibration**: \[ \text{logits}_{\text{final}}=\text{logits}_{zs}+\alpha\cdot\frac{\text{logits}_{cc}}{\sigma(f)} \] where $\text{logits}_{zs}$ is the logits of the zero - shot classifier, $\alpha$ is the weight coefficient, and $\sigma(f)$ is the prediction variance. Through these improvements, the authors successfully solve the three key problems of existing cache models and improve the performance of few - shot vision - language model adaptation tasks.

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Open-Vocabulary Calibration for Fine-tuned CLIP

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models

Robust Calibration of Large Vision-Language Adapters

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

An Empirical Study Into What Matters for Calibrating Vision-Language Models