Abstract:Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{<a class="link-external link-https" href="https://github.com/Mehrdad-Noori/WATT.git" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper aims to address the performance degradation issue of Vision-Language Models (VLMs), particularly the CLIP model, when faced with data from different domains. Specifically, the generalization ability of the CLIP model is affected when it encounters new data that differs from the training data distribution. To tackle this challenge, the researchers proposed a new method called WATT (Weight Average Test-Time Adaptation). The main contributions of WATT are as follows: 1. **Proposed a novel Test-Time Adaptation (TTA) method**: This method is the first to utilize weight averaging techniques during the test phase to enhance the adaptability of the CLIP model. By combining multiple text prompt templates, diverse model hypotheses can be generated, and these can be integrated through weight averaging to improve the overall performance of the model. 2. **Achieved effective adaptation with a single image**: Compared to other test-time adaptation methods, WATT can effectively operate with only one image, which is a significant advantage. 3. **Extensive experimental validation**: The authors conducted numerous experimental evaluations, demonstrating the effectiveness of the WATT method on several challenging datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, and VisDA-C. These experiments covered various types of domain shifts, showcasing the robustness and effectiveness of WATT in different scenarios. In summary, the goal of WATT is to improve the adaptability and performance of the CLIP model on unseen data by utilizing multiple text prompt templates and weight averaging strategies, without altering the model structure or introducing additional trainable modules.

WATT: Weight Average Test-Time Adaptation of CLIP

CLIPArTT: Adaptation of CLIP to New Domains at Test Time

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Test-time Alignment-Enhanced Adapter for Vision-Language Models

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-time Adaptation for Vision-Language Models

Effectiveness of Vision Language Models for Open-world Single Image Test Time Adaptation

VCL Challenges 2023 at ICCV 2023 Technical Report: Bi-level Adaptation Method for Test-time Adaptive Object Detection

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Efficient Test-Time Adaptation of Vision-Language Models

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

TiC-CLIP: Continual Training of CLIP Models

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Distinguishing Textual Prompt Importance: Image-Guided Text Weighting for CLIP-Based Few-shot Learning

Video Test-Time Adaptation for Action Recognition

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

DOTA: Distributional Test-Time Adaptation of Vision-Language Models