BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Xuefeng Hu,Ke Zhang,Min Sun,Albert Chen,Cheng-Hao Kuo,Ram Nevatia
2024-06-18
Abstract:Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the test - time adaptation (TTA) process of zero - shot vision - language models (such as CLIP), how to avoid the training collapse problem caused by the lack of validation data and improve the model performance. Specifically, the existing TTA methods face challenges in selecting appropriate learning rates, which may lead to model instability or performance degradation. In addition, these methods usually rely on back - propagation to update model weights, which not only increases the computational complexity but also may cause potential model - collapse risks. To this end, the authors propose a new algorithm named BaFTA (Backprop - Free Test - Time Adaptation). The main contributions and solutions of BaFTA are as follows: 1. **Stable and Efficient Test - Time Adaptation without Back - Propagation**: - BaFTA estimates class embeddings directly in the unified vision - text embedding space without the need for back - propagation to update model weights. This method takes advantage of the natural clustering characteristics of high - quality visual embeddings and avoids the instability brought by self - supervised training at test time. 2. **Stable Online Clustering Based on Rényi Entropy Aggregation**: - To address the problem of biased assignment that may occur in naive online clustering, BaFTA introduces a new Rényi entropy aggregation mechanism. This mechanism dynamically combines the results from text and clustering predictions and performs a weighted average according to their reliability, thereby improving the accuracy and robustness of predictions. 3. **Extensive Experimental Verification**: - Through a large number of experiments, the authors verify the effectiveness of BaFTA and its innovative components, significantly improving the zero - shot classification accuracy of pre - trained vision - language models at inference time and being significantly faster. ### Formula Summary - **Class Embedding Estimation in Online Clustering**: \[ w_j=\frac{t_j}{\|t_j\|} \] \[ w_{y_i}=\frac{k_{y_i}w_{y_i}+v_i}{\|k_{y_i}w_{y_i}+v_i\|} \] \[ k_{y_i}=k_{y_i}+1 \] - **Projection Alignment**: \[ P^*(x):=\frac{U'U'^{\top}x}{\|U'U'^{\top}x\|} \] where \(U' = [e_2, e_3,\ldots, e_J]\) is the orthogonal basis after removing the principal component. - **Rényi Entropy Calculation**: \[ Re(p)=\frac{1}{\alpha - 1}\log\left(\sum_{j = 1}^{J}(p[j])^{\alpha}\right) \] - **Prediction Aggregation**: \[ \tilde{p}_i=\beta\frac{\sum_{b = 1}^{B}Re(p_i^b)p_i^b}{R_1}+\frac{\sum_{b = 1}^{B}Re(\hat{p}_i^b)\hat{p}_i^b}{R_2} \] where \(R_1=(1 + \beta)\sum_{b = 1}^{B}Re(p_i^b)\), \(R_2=(1 + \beta)\sum_{b = 1}^{B}Re(\hat{p}_i^b)\). Through these improvements, BaFTA significantly improves the performance and stability of the model while maintaining the zero - shot paradigm, especially performing well on large - scale datasets.