Abstract:Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP's visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at <a class="link-external link-https" href="https://github.com/GIT-LJc/UMFC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the distribution shift problem of pre-trained vision-language models (such as CLIP) across different domains. Although these models demonstrate strong zero-shot transfer capabilities, they still perform poorly on downstream tasks in new domains, often requiring labeled data for adaptation, which is costly in practical applications. To this end, the paper proposes a method to enhance the transferability of vision-language models using unlabeled multi-domain data. ### Specific Problem Description 1. **Domain Shift Problem**: The performance of pre-trained models significantly drops when facing downstream tasks with distributions different from the pre-training data. 2. **Labeled Data Requirement**: Existing adaptation methods usually require a large amount of labeled data, which is difficult to meet in practical applications. 3. **Model Bias**: The paper finds inherent biases in models like CLIP within their visual and text encoders, specifically: - **Visual Encoder Bias**: CLIP's visual encoder tends to prioritize encoding domain information over category information, leading to significant differences in classification accuracy across different domains. - **Text Encoder Bias**: CLIP's text encoder shows a preference for specific categories in different domains, such as misclassifying many samples as "squiggle" or "line" in the "quickdraw" domain. ### Solution To mitigate the above issues, the paper proposes an Unsupervised Multi-domain Feature Calibration (UMFC) method. UMFC calibrates CLIP's features through the following two modules: 1. **Image Feature Calibration Module (IFC)**: - Distinguishes image features from different domains using clustering algorithms. - Calculates the average image feature for each domain and subtracts these domain-specific biases from the original features to achieve invariance across different domains. 2. **Text Feature Calibration Module (TFC)**: - Estimates the domain transformation direction of text features using image features from different domains. - Eliminates the text encoder's preference for specific category names by subtracting the domain transformation vector. ### Experimental Validation The paper conducts experimental validation on multiple datasets, including DomainNet and ImageNet variants. The experimental results show that UMFC significantly improves CLIP's performance in various scenarios such as unsupervised calibration, transfer learning, and test-time adaptation, without requiring additional labeled data or parameter fine-tuning. ### Main Contributions 1. **Unsupervised Multi-domain Feature Calibration**: Proposes an unsupervised method that uses unlabeled multi-domain data to calibrate the features of vision-language models, enhancing their generalization ability across different domains. 2. **Reducing Model Bias**: Effectively reduces biases in CLIP's visual and text encoders through image and text feature calibration. 3. **Efficiency**: UMFC is a training-free method with high computational efficiency, suitable for large-scale datasets. ### Conclusion This paper proposes an effective Unsupervised Multi-domain Feature Calibration (UMFC) method that significantly enhances the generalization ability of pre-trained vision-language models across different domains without using labeled data. This method provides a new solution for handling large-scale unlabeled data in practical applications.

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Adaptive Prompt Learning with Negative Textual Semantics and Uncertainty Modeling for Universal Multi-Source Domain Adaptation

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

CLIP the Divergence: Language-guided Unsupervised Domain Adaptation

Lightweight Unsupervised Federated Learning with Pretrained Vision Language Model

Exploiting multi-level consistency learning for source-free domain adaptation

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Domain-Adaptive Semantic Segmentation Emerges From Vision-Language Supervised Domain-Debiased Self-Training.

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

Open-Vocabulary Calibration for Fine-tuned CLIP

Domain Alignment with Large Vision-language Models for Cross-domain Remote Sensing Image Retrieval

Adversarial Domain Adaptation with CLIP for Few-Shot Image Classification

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

Towards Realistic Unsupervised Fine-tuning with CLIP