Abstract:Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient robustness of the CLIP (Contrastive Language - Image Pretraining) model when facing common image corruptions. Specifically: 1. **Poor performance of CLIP's zero - sample learning ability under image corruption**: - Although CLIP performs well when dealing with clean images, its performance drops significantly when encountering image corruptions of different severities during the testing phase. - For example, on the CIFAR - 100 dataset, the CLIP model using ResNet - 101 as the visual backbone network has an accuracy rate that drops sharply from 49% to 10.79% when the Gaussian noise severity is 5. 2. **Limitations of existing test - time adaptation (TTA) methods**: - Most of the existing TTA methods are only for a single modality (such as only adjusting the visual encoder or the text encoder), which limits the effective adaptation ability of multi - modal models like CLIP. - For example, TPT (Test - time Prompt Tuning) only adjusts the text prompt, and VTE (Vision - Text Ensemble) depends on a fixed encoder and cannot effectively deal with severe image corruptions. 3. **Propose a new solution**: - The paper proposes BAT - CLIP (Bimodal Test - Time Adaptation for CLIP), a bimodal TTA method, which enhances the robustness of CLIP to common image corruptions by simultaneously adjusting the visual and text encoders. - BAT - CLIP not only improves image feature extraction but also strengthens the alignment between image class prototypes and corresponding text features, thereby improving robustness and accuracy in classification tasks. ### Main contributions 1. **In - depth analysis of CLIP's zero - sample performance**: - The zero - sample classification performance of CLIP is evaluated in detail under different visual backbones and different severities of image corruptions, revealing the problem of its performance degradation on corrupted images. 2. **Bimodal adaptation method**: - BAT - CLIP is proposed. By maximizing the projection matching between class prototypes and text features and increasing the cosine distance between class prototypes, the alignment of visual and text features is enhanced, making the adaptation process more flexible and robust. 3. **Experimental verification**: - Extensive experiments are carried out on multiple benchmark datasets (such as CIFAR - 10C, CIFAR - 100C and ImageNet - C), proving that BAT - CLIP is superior to existing TTA methods in online TTA tasks, especially in dealing with image corruptions. Through these improvements, BAT - CLIP can more effectively deal with image corruptions in practical applications, especially in safety - critical fields such as autonomous driving, ensuring the stability and reliability of the model in complex environments.

Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

CLIPArTT: Adaptation of CLIP to New Domains at Test Time

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

Toward a Holistic Evaluation of Robustness in CLIP Models

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks

Benchmarking PathCLIP for Pathology Image Analysis

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study

Improving CLIP Training with Language Rewrites

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Improving CLIP Robustness with Knowledge Distillation and Self-Training

Understanding the Vulnerability of CLIP to Image Compression

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models