Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Sarthak Kumar Maharana,Baoming Zhang,Leonid Karlinsky,Rogerio Feris,Yunhui Guo
2024-12-04
Abstract:Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient robustness of the CLIP (Contrastive Language - Image Pretraining) model when facing common image corruptions. Specifically: 1. **Poor performance of CLIP's zero - sample learning ability under image corruption**: - Although CLIP performs well when dealing with clean images, its performance drops significantly when encountering image corruptions of different severities during the testing phase. - For example, on the CIFAR - 100 dataset, the CLIP model using ResNet - 101 as the visual backbone network has an accuracy rate that drops sharply from 49% to 10.79% when the Gaussian noise severity is 5. 2. **Limitations of existing test - time adaptation (TTA) methods**: - Most of the existing TTA methods are only for a single modality (such as only adjusting the visual encoder or the text encoder), which limits the effective adaptation ability of multi - modal models like CLIP. - For example, TPT (Test - time Prompt Tuning) only adjusts the text prompt, and VTE (Vision - Text Ensemble) depends on a fixed encoder and cannot effectively deal with severe image corruptions. 3. **Propose a new solution**: - The paper proposes BAT - CLIP (Bimodal Test - Time Adaptation for CLIP), a bimodal TTA method, which enhances the robustness of CLIP to common image corruptions by simultaneously adjusting the visual and text encoders. - BAT - CLIP not only improves image feature extraction but also strengthens the alignment between image class prototypes and corresponding text features, thereby improving robustness and accuracy in classification tasks. ### Main contributions 1. **In - depth analysis of CLIP's zero - sample performance**: - The zero - sample classification performance of CLIP is evaluated in detail under different visual backbones and different severities of image corruptions, revealing the problem of its performance degradation on corrupted images. 2. **Bimodal adaptation method**: - BAT - CLIP is proposed. By maximizing the projection matching between class prototypes and text features and increasing the cosine distance between class prototypes, the alignment of visual and text features is enhanced, making the adaptation process more flexible and robust. 3. **Experimental verification**: - Extensive experiments are carried out on multiple benchmark datasets (such as CIFAR - 10C, CIFAR - 100C and ImageNet - C), proving that BAT - CLIP is superior to existing TTA methods in online TTA tasks, especially in dealing with image corruptions. Through these improvements, BAT - CLIP can more effectively deal with image corruptions in practical applications, especially in safety - critical fields such as autonomous driving, ensuring the stability and reliability of the model in complex environments.