Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Matteo Farina,Gianni Franchi,Giovanni Iacca,Massimiliano Mancini,Elisa Ricci
2024-05-29
Abstract:Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with "zero" temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero. Remarkably, ZERO requires a single batched forward pass through the vision encoder only and no backward passes. We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10x faster and 13x more memory-friendly than standard Test-Time Prompt Tuning. Thanks to its simplicity and comparatively negligible computation, ZERO can serve as a strong baseline for future work in this field. The code is available at <a class="link-external link-https" href="https://github.com/FarinaMatteo/zero" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient generalization ability of Vision - Language Models (VLMs) when facing challenging samples. Specifically, although VLMs can make seamless distinctions between different semantic categories, they still perform poorly when dealing with challenging samples. For this reason, the paper introduces a strategy named Episodic Test - Time Adaptation (TTA), aiming to adapt VLMs through a single unlabeled image, thereby improving their robustness and generalization ability. ### Main Problem Description in the Paper 1. **Insufficient Generalization Ability**: When there is a large difference between the training set and the test set, the performance of VLMs will drop significantly. 2. **Limitations of Existing Methods**: The current mainstream TTA methods rely on Marginal Entropy Minimization (MEM). Although this method is effective, it has a high computational cost and requires online back - propagation, which leads to slower inference speed and increased memory consumption. ### Proposed Solution The paper proposes a method named ZERO, which is a TTA method without optimizing parameters. The core idea of ZERO is to set the temperature parameter in the Softmax layer to zero, thereby enhancing the prediction reliability of the model. The specific steps include: - Performing multiple data augmentations on the input image. - Making predictions and retaining the most confident prediction results. - Setting the Softmax temperature to zero and then marginalizing the prediction results. ### Main Contributions 1. **Theoretical Analysis**: Through theoretical analysis, the paper proves that under certain conditions, MEM has little impact on the marginal probability distribution, that is, it will not change the prediction category of the model. 2. **Lower Bound of Error Rate**: The paper proves that the error rate of the marginal probability distribution is the lower bound of the error rate of the basic model. 3. **ZERO Method**: The ZERO method is introduced. This method is not only simple and effective, but also about 10 times faster than the existing TTA methods and reduces the memory occupation by about 13 times. ### Experimental Verification The paper conducts experiments on multiple benchmark datasets to verify the effectiveness and efficiency of the ZERO method. The experimental results show that ZERO outperforms or is comparable to the existing state - of - the - art TTA methods in various tasks, while having a faster speed and lower memory consumption. ### Summary Through in - depth theoretical analysis and experiments, this paper proposes a simple and effective TTA method - ZERO, which solves the problem of insufficient generalization ability of VLMs during testing and significantly improves the robustness and efficiency of the model.