Abstract:Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the test - time adaptation (TTA) process of zero - shot vision - language models (such as CLIP), how to avoid the training collapse problem caused by the lack of validation data and improve the model performance. Specifically, the existing TTA methods face challenges in selecting appropriate learning rates, which may lead to model instability or performance degradation. In addition, these methods usually rely on back - propagation to update model weights, which not only increases the computational complexity but also may cause potential model - collapse risks. To this end, the authors propose a new algorithm named BaFTA (Backprop - Free Test - Time Adaptation). The main contributions and solutions of BaFTA are as follows: 1. **Stable and Efficient Test - Time Adaptation without Back - Propagation**: - BaFTA estimates class embeddings directly in the unified vision - text embedding space without the need for back - propagation to update model weights. This method takes advantage of the natural clustering characteristics of high - quality visual embeddings and avoids the instability brought by self - supervised training at test time. 2. **Stable Online Clustering Based on Rényi Entropy Aggregation**: - To address the problem of biased assignment that may occur in naive online clustering, BaFTA introduces a new Rényi entropy aggregation mechanism. This mechanism dynamically combines the results from text and clustering predictions and performs a weighted average according to their reliability, thereby improving the accuracy and robustness of predictions. 3. **Extensive Experimental Verification**: - Through a large number of experiments, the authors verify the effectiveness of BaFTA and its innovative components, significantly improving the zero - shot classification accuracy of pre - trained vision - language models at inference time and being significantly faster. ### Formula Summary - **Class Embedding Estimation in Online Clustering**: \[ w_j=\frac{t_j}{\|t_j\|} \] \[ w_{y_i}=\frac{k_{y_i}w_{y_i}+v_i}{\|k_{y_i}w_{y_i}+v_i\|} \] \[ k_{y_i}=k_{y_i}+1 \] - **Projection Alignment**: \[ P^*(x):=\frac{U'U'^{\top}x}{\|U'U'^{\top}x\|} \] where \(U' = [e_2, e_3,\ldots, e_J]\) is the orthogonal basis after removing the principal component. - **Rényi Entropy Calculation**: \[ Re(p)=\frac{1}{\alpha - 1}\log\left(\sum_{j = 1}^{J}(p[j])^{\alpha}\right) \] - **Prediction Aggregation**: \[ \tilde{p}_i=\beta\frac{\sum_{b = 1}^{B}Re(p_i^b)p_i^b}{R_1}+\frac{\sum_{b = 1}^{B}Re(\hat{p}_i^b)\hat{p}_i^b}{R_2} \] where \(R_1=(1 + \beta)\sum_{b = 1}^{B}Re(p_i^b)\), \(R_2=(1 + \beta)\sum_{b = 1}^{B}Re(\hat{p}_i^b)\). Through these improvements, BaFTA significantly improves the performance and stability of the model while maintaining the zero - shot paradigm, especially performing well on large - scale datasets.

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Efficient Test-Time Adaptation of Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-time Adaptation for Vision-Language Models

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Test-time Alignment-Enhanced Adapter for Vision-Language Models

Efficient Test-Time Prompt Tuning for Vision-Language Models

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

CLIPArTT: Adaptation of CLIP to New Domains at Test Time

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model