Abstract:Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the significant gap between the zero - shot performance of pre - trained vision - language models (VLMs) on domain - specific datasets and that of supervised learning models. Specifically, although pre - trained VLMs like CLIP perform well in zero - shot settings, in downstream tasks in specific domains, their performance is still inferior to supervised models specifically trained for these tasks. To bridge this gap, the authors propose a novel active learning (AL) framework, which enhances the zero - shot classification performance of VLMs by selecting a small number of informative samples from unlabeled data for labeling. Specific methods include: 1. **Calibrating prediction entropy**: Since large pre - trained models (such as CLIP) usually produce uncalibrated outputs, leading to unbalanced predictions, the authors introduce calibrated entropy to reduce the bias towards common categories. 2. **Neighbor uncertainty**: In addition to self - uncertainty, the uncertainty of sample neighbors is also considered to ensure that the selected samples are not only uncertain in themselves, but also have high uncertainty in their surrounding areas. 3. **Uncertainty - weighted clustering**: Through the clustering method, it is ensured that the selected samples come from different regions of the feature space, thereby increasing the diversity of samples. Finally, the experimental results show that this method significantly outperforms existing active learning methods on multiple image classification datasets and greatly improves the zero - shot performance of VLMs. ### Formula summary 1. **Entropy calculation formula**: \[ H(x)=-\sum_{i = 1}^{K}P(y = i|x)\cdot\log(P(y = i|x)) \] where \(P(y = i|x)\) is the probability calculated according to formula (1). 2. **Calibrated probability formula**: \[ \hat{P}(y = i|x)=\left(\frac{P(y = i|x)}{Q(i)}\right)/\left(\sum_{j = 1}^{K}\frac{P(y = j|x)}{Q(j)}\right) \] where \(Q(i)\) is the context prior of the \(i\)-th class, defined as: \[ Q(i)\approx\frac{1}{N}\sum_{x\in S_{i}}P(y = i|x) \] 3. **Neighbor uncertainty formula**: \[ H_{NN}(x)=\frac{1}{k}\sum_{x_{i}\in kNN(x)}\exp(-\alpha\|z - z_{i}\|_{2}^{2})\cdot\hat{H}(x_{i}) \] where \(z = \frac{f(x)}{\|f(x)\|}\) is the normalized visual representation. 4. **Comprehensive uncertainty formula**: \[ U(x)=\hat{H}(x)+H_{NN}(x) \] Through these methods, the paper proposes an effective active learning framework that can significantly improve the performance of VLMs with limited labeled data.

Active Learning for Vision-Language Models

Active Prompt Learning in Vision Language Models

The Neglected Tails in Vision-Language Models

Revisiting Active Learning in the Era of Vision Foundation Models

Avoid Wasted Annotation Costs in Open-set Active Learning with Pre-trained Vision-Language Model

Active Prompt Learning with Vision-Language Model Priors

Post-hoc Probabilistic Vision-Language Models

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Towards Multimodal In-Context Learning for Vision & Language Models

Exploring Vision-Language Models for Imbalanced Learning

What Makes Good Few-shot Examples for Vision-Language Models?

Vision-Language Models for Zero-Shot Classification of Remote Sensing Images

LLM meets Vision-Language Models for Zero-Shot One-Class Classification

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Label Propagation for Zero-shot Classification with Vision-Language Models

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection