Abstract:Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is signifi- cantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribu- tion drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed cen- troids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representa- tion learning but without incurring additional computational complexity. Extensive experiments on multiple popular test- time adaptation benchmarks presenting diverse complex- ity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that during the testing phase, when pre - trained vision - language models (such as CLIP) encounter data distribution shift, their performance drops significantly. Specifically, the paper focuses on how to efficiently utilize category text information to alleviate the distribution drift problem encountered during the inference process of large pre - trained vision - language models at test time. The paper proposes a method named CLIP - OT to solve this problem by generating pseudo - labels. This method uses general - category text embeddings as fixed centroids in the label - assignment problem and efficiently solves it through the optimal transport algorithm. In addition, CLIP - OT also integrates a multi - template knowledge distillation method, which can replicate the multi - view contrastive learning strategy in unsupervised representation learning without increasing additional computational complexity. ### Main Contributions 1. **Pseudo - label Strategy**: Model the pseudo - label strategy in CLIP test - time adaptation as an optimal transport task, using the category text information available in the vision - language model as fixed clustering centers without further annotation. 2. **Optimal Transport Solution**: Use the Sinkhorn algorithm to solve the label - assignment task, which can handle multimodal distributions and efficiently calculate label - assignment. 3. **Multi - template Knowledge Distillation**: Introduce a multi - template knowledge distillation method, using the richer information derived from different text prompts to better guide the adaptation process without increasing significant computational or memory overhead. 4. **Experimental Verification**: Experiments in 244 scenarios show that CLIP - OT performs superiorly among the recent state - of - the - art methods while avoiding additional computational complexity. ### Method Overview 1. **Preliminary**: CLIP is a basic vision - language model. Trained through contrastive learning, it can generate visual representations from images and associated text descriptions. At test time, this model can perform zero - shot prediction. 2. **Learning Objective**: To overcome the risk of simply minimizing entropy in the completely unlabeled case, the paper first encodes the model prediction as a posterior distribution and then defines an optimization objective aiming to maximize the similarity between image features and category text prototypes. 3. **Optimization**: Optimize the objective function through the Sinkhorn algorithm, which introduces an entropy constraint, making the optimal regularized transport have a simple structure. The optimization problem is solved on each batch, and the dimension of matrix \(Q\) is \(K\times B_s\), where \(B_s\) is the batch size. 4. **Knowledge Distillation of Multiple Text Prototypes**: Utilize multiple category embeddings, each obtained through a different text template, and distill these diverse and rich representations by optimizing the cross - entropy multiple times. ### Experimental Results 1. **Evaluation in Natural or No - Domain - Shift Situations**: On the CIFAR - 10, CIFAR - 10.1 and CIFAR - 100 datasets, the CLIP - OT method performs excellently in natural or no - domain - shift situations, consistently outperforming existing methods. 2. **Performance Impact under Common Corruptions**: On the CIFAR - 10C and CIFAR - 100C datasets, CLIP - OT shows a significant performance improvement under common corruptions, with a performance improvement of nearly 18% compared to the baseline CLIP model. Overall, this paper effectively solves the data distribution shift problem encountered by the CLIP model at test time by proposing the CLIP - OT method, significantly improving the model's robustness and adaptability.

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

CLIPArTT: Adaptation of CLIP to New Domains at Test Time

WATT: Weight Average Test-Time Adaptation of CLIP

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-time Adaptation for Vision-Language Models

Improving CLIP Training with Language Rewrites

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Effectiveness of Vision Language Models for Open-world Single Image Test Time Adaptation

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Finetuning CLIP to Reason about Pairwise Differences

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

How Much Can CLIP Benefit Vision-and-Language Tasks?