Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

Xingyu Zhu,Beier Zhu,Yi Tan,Shuo Wang,Yanbin Hao,Hanwang Zhang
2024-10-25
Abstract:Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of $2.6\%$ on 10 datasets with CLIP ViT-B/16 and achieving an average margin of $1.5\%$ on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in <a class="link-external link-https" href="https://github.com/zhuhsingyuu/Frolic" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges faced by zero - shot vision models in practical applications: 1. **Label Dependence**: Existing methods improve performance by optimizing prompts in downstream labeled data, but this requires human - labeled data, which is both labor - intensive and limited by the labeling quality. Moreover, the scalability of these methods is restricted. 2. **Label Bias**: Since models such as CLIP are pre - trained on highly unbalanced web - scale data, they have an inherent label bias, resulting in sub - optimal performance. This bias causes the prediction probabilities of the model to be too high or too low for certain classes, affecting the overall performance. To solve the above problems, the authors propose a framework named Frolic, which enhances zero - shot performance through unlabeled prompt distribution learning and bias correction. Specifically, Frolic achieves the following goals: - **Unlabeled Prompt Distribution Learning**: Captures diverse visual representations by learning the distribution of prompt prototypes without the need for labeled data. - **Unlabeled Bias Correction**: Estimates and corrects label bias in pre - training data through an unsupervised method, thus achieving more balanced predictions. - **No Need for Hyperparameter Tuning**: Dynamically balances the contributions of the original CLIP model and the Gaussian - distribution - based model through an adaptive fusion technique, eliminating the need for hyperparameter search. Experimental results show that Frolic significantly improves the performance of zero - shot models on multiple datasets, especially on ImageNet and its five distribution - shift datasets, with an average accuracy improvement of 2.6%. ### Formula Summary 1. **Visual and Text Representation Computation**: \[ x_i=\Phi_v(x_i), \quad z_j = \Phi_t(z_j) \] where \(x_i\) and \(z_j\) have the same dimension \((x, z\in\mathbb{R}^d)\). 2. **Zero - Shot Prediction**: \[ y=\arg\max_j f_c(x)_j=\arg\max_j z_j^{\top}x \] where \(f_c(x)_j = z_j^{\top}x\) is the score of class \(j\). 3. **Gaussian Discriminant Analysis Prediction**: \[ y=\arg\max_j f_g(x)_j=\arg\max_j w_j^{\top}x + b_j \] where \(w_j=\hat{\Sigma}^{-1}z_j\), \(b_j =-\frac{1}{2}z_j^{\top}w_j\). 4. **Prediction Fusion**: \[ f_f(x)=f_g(x)/\tau_g + f_c(x)/\tau_c \] 5. **Bias Correction**: \[ f_d(x)_y=f_f(x)_y-\ln\beta_y \] Through these methods, Frolic not only improves the performance of zero - shot models but also avoids the dependence on labeled data and hyperparameter tuning.