Active Learning for Vision-Language Models

Bardia Safaei,Vishal M. Patel
2024-10-30
Abstract:Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the significant gap between the zero - shot performance of pre - trained vision - language models (VLMs) on domain - specific datasets and that of supervised learning models. Specifically, although pre - trained VLMs like CLIP perform well in zero - shot settings, in downstream tasks in specific domains, their performance is still inferior to supervised models specifically trained for these tasks. To bridge this gap, the authors propose a novel active learning (AL) framework, which enhances the zero - shot classification performance of VLMs by selecting a small number of informative samples from unlabeled data for labeling. Specific methods include: 1. **Calibrating prediction entropy**: Since large pre - trained models (such as CLIP) usually produce uncalibrated outputs, leading to unbalanced predictions, the authors introduce calibrated entropy to reduce the bias towards common categories. 2. **Neighbor uncertainty**: In addition to self - uncertainty, the uncertainty of sample neighbors is also considered to ensure that the selected samples are not only uncertain in themselves, but also have high uncertainty in their surrounding areas. 3. **Uncertainty - weighted clustering**: Through the clustering method, it is ensured that the selected samples come from different regions of the feature space, thereby increasing the diversity of samples. Finally, the experimental results show that this method significantly outperforms existing active learning methods on multiple image classification datasets and greatly improves the zero - shot performance of VLMs. ### Formula summary 1. **Entropy calculation formula**: \[ H(x)=-\sum_{i = 1}^{K}P(y = i|x)\cdot\log(P(y = i|x)) \] where \(P(y = i|x)\) is the probability calculated according to formula (1). 2. **Calibrated probability formula**: \[ \hat{P}(y = i|x)=\left(\frac{P(y = i|x)}{Q(i)}\right)/\left(\sum_{j = 1}^{K}\frac{P(y = j|x)}{Q(j)}\right) \] where \(Q(i)\) is the context prior of the \(i\)-th class, defined as: \[ Q(i)\approx\frac{1}{N}\sum_{x\in S_{i}}P(y = i|x) \] 3. **Neighbor uncertainty formula**: \[ H_{NN}(x)=\frac{1}{k}\sum_{x_{i}\in kNN(x)}\exp(-\alpha\|z - z_{i}\|_{2}^{2})\cdot\hat{H}(x_{i}) \] where \(z = \frac{f(x)}{\|f(x)\|}\) is the normalized visual representation. 4. **Comprehensive uncertainty formula**: \[ U(x)=\hat{H}(x)+H_{NN}(x) \] Through these methods, the paper proposes an effective active learning framework that can significantly improve the performance of VLMs with limited labeled data.