Abstract:Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of $2.6\%$ on 10 datasets with CLIP ViT-B/16 and achieving an average margin of $1.5\%$ on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in <a class="link-external link-https" href="https://github.com/zhuhsingyuu/Frolic" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main challenges faced by zero - shot vision models in practical applications: 1. **Label Dependence**: Existing methods improve performance by optimizing prompts in downstream labeled data, but this requires human - labeled data, which is both labor - intensive and limited by the labeling quality. Moreover, the scalability of these methods is restricted. 2. **Label Bias**: Since models such as CLIP are pre - trained on highly unbalanced web - scale data, they have an inherent label bias, resulting in sub - optimal performance. This bias causes the prediction probabilities of the model to be too high or too low for certain classes, affecting the overall performance. To solve the above problems, the authors propose a framework named Frolic, which enhances zero - shot performance through unlabeled prompt distribution learning and bias correction. Specifically, Frolic achieves the following goals: - **Unlabeled Prompt Distribution Learning**: Captures diverse visual representations by learning the distribution of prompt prototypes without the need for labeled data. - **Unlabeled Bias Correction**: Estimates and corrects label bias in pre - training data through an unsupervised method, thus achieving more balanced predictions. - **No Need for Hyperparameter Tuning**: Dynamically balances the contributions of the original CLIP model and the Gaussian - distribution - based model through an adaptive fusion technique, eliminating the need for hyperparameter search. Experimental results show that Frolic significantly improves the performance of zero - shot models on multiple datasets, especially on ImageNet and its five distribution - shift datasets, with an average accuracy improvement of 2.6%. ### Formula Summary 1. **Visual and Text Representation Computation**: \[ x_i=\Phi_v(x_i), \quad z_j = \Phi_t(z_j) \] where $x_i$ and $z_j$ have the same dimension $(x, z\in\mathbb{R}^d)$. 2. **Zero - Shot Prediction**: \[ y=\arg\max_j f_c(x)_j=\arg\max_j z_j^{\top}x \] where $f_c(x)_j = z_j^{\top}x$ is the score of class $j$. 3. **Gaussian Discriminant Analysis Prediction**: \[ y=\arg\max_j f_g(x)_j=\arg\max_j w_j^{\top}x + b_j \] where $w_j=\hat{\Sigma}^{-1}z_j$, $b_j =-\frac{1}{2}z_j^{\top}w_j$. 4. **Prediction Fusion**: \[ f_f(x)=f_g(x)/\tau_g + f_c(x)/\tau_c \] 5. **Bias Correction**: \[ f_d(x)_y=f_f(x)_y-\ln\beta_y \] Through these methods, Frolic not only improves the performance of zero - shot models but also avoids the dependence on labeled data and hyperparameter tuning.

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels

SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

Fine-Grained Visual Prompting

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Learning to Decompose Visual Features with Latent Textual Prompts

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Quantized Prompt for Efficient Generalization of Vision-Language Models

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Prompt Distribution Learning