WATT: Weight Average Test-Time Adaptation of CLIP

David Osowiechi,Mehrdad Noori,Gustavo Adolfo Vargas Hakim,Moslem Yazdanpanah,Ali Bahri,Milad Cheraghalikhani,Sahar Dastani,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers
2024-06-25
Abstract:Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{<a class="link-external link-https" href="https://github.com/Mehrdad-Noori/WATT.git" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the performance degradation issue of Vision-Language Models (VLMs), particularly the CLIP model, when faced with data from different domains. Specifically, the generalization ability of the CLIP model is affected when it encounters new data that differs from the training data distribution. To tackle this challenge, the researchers proposed a new method called WATT (Weight Average Test-Time Adaptation). The main contributions of WATT are as follows: 1. **Proposed a novel Test-Time Adaptation (TTA) method**: This method is the first to utilize weight averaging techniques during the test phase to enhance the adaptability of the CLIP model. By combining multiple text prompt templates, diverse model hypotheses can be generated, and these can be integrated through weight averaging to improve the overall performance of the model. 2. **Achieved effective adaptation with a single image**: Compared to other test-time adaptation methods, WATT can effectively operate with only one image, which is a significant advantage. 3. **Extensive experimental validation**: The authors conducted numerous experimental evaluations, demonstrating the effectiveness of the WATT method on several challenging datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, and VisDA-C. These experiments covered various types of domain shifts, showcasing the robustness and effectiveness of WATT in different scenarios. In summary, the goal of WATT is to improve the adaptability and performance of the CLIP model on unseen data by utilizing multiple text prompt templates and weight averaging strategies, without altering the model structure or introducing additional trainable modules.