AD-CLIP: Adapting Domains in Prompt Space Using CLIP

Mainak Singha,Harsh Pal,Ankit Jha,Biplab Banerjee
2024-09-16
Abstract:Although deep learning models have shown impressive performance on supervised learning tasks, they often struggle to generalize well when the training (source) and test (target) domains differ. Unsupervised domain adaptation (DA) has emerged as a popular solution to this problem. However, current DA techniques rely on visual backbones, which may lack semantic richness. Despite the potential of large-scale vision-language foundation models like CLIP, their effectiveness for DA has yet to be fully explored. To address this gap, we introduce \textsc{AD-CLIP}, a domain-agnostic prompt learning strategy for CLIP that aims to solve the DA problem in the prompt space. We leverage the frozen vision backbone of CLIP to extract both image style (domain) and content information, which we apply to learn prompt tokens. Our prompts are designed to be domain-invariant and class-generalizable, by conditioning prompt learning on image style and content features simultaneously. We use standard supervised contrastive learning in the source domain, while proposing an entropy minimization strategy to align domains in the embedding space given the target domain data. We also consider a scenario where only target domain samples are available during testing, without any source domain data, and propose a cross-domain style mapping network to hallucinate domain-agnostic tokens. Our extensive experiments on three benchmark DA datasets demonstrate the effectiveness of \textsc{AD-CLIP} compared to existing literature. Code is available at \url{<a class="link-external link-https" href="https://github.com/mainaksingha01/AD-CLIP" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **Domain Adaptation (DA) problem**, especially the problem of insufficient generalization ability of deep - learning models when there are differences between the training (source) domain and the testing (target) domain. Specifically, the authors propose a method named **AD - CLIP**, which performs domain adaptation in the prompt space by leveraging the semantic richness of the large - scale vision - language foundation model CLIP. #### Main problem background 1. **Domain shift problem**: When the training data and the testing data come from different distributions, the performance of deep - learning models will decline significantly. This is because the model assumes that the training data and the testing data come from the same distribution during training (PAC assumption), but in practical applications, this assumption is often not valid. 2. **Limitations of existing methods**: - Current unsupervised domain adaptation (UDA) techniques mainly rely on convolutional neural networks (CNNs), and these models usually lack semantic richness, resulting in poor performance in key tasks. - Although large - scale vision - language models (such as CLIP) have powerful feature extraction capabilities, their potential in domain adaptation tasks has not been fully explored. #### AD - CLIP's solutions To solve the above problems, AD - CLIP proposes the following innovations: 1. **Prompt learning strategy**: By learning new prompt words in the prompt space, these prompt words can remain invariant (domain - independent) across different domains and can be generalized to different categories. Specifically, AD - CLIP introduces two types of prompt words: - **Domain Token**: Used to capture the style information of images and help the model understand the characteristics of different domains. - **Image - Specific Tokens**: Used to learn the content information of images and help the model better understand the semantic meaning of images. 2. **Domain alignment strategy**: In order to align the data of the target domain with the data of the source domain in the embedding space, AD - CLIP proposes an entropy minimization strategy. This strategy ensures the performance improvement of the model on the target domain by minimizing the distribution difference between the target - domain samples and the source - domain samples. 3. **Cross - domain style mapping network**: In the inference stage, when only target - domain samples are available and no source - domain samples are available, AD - CLIP proposes a cross - domain style mapping network (Cross - Domain Style Mapping Network) to generate domain - independent prompt words. 4. **Experimental verification**: Through extensive experiments on three benchmark datasets (Office - Home, VisDA - 2017, and Mini - DomainNet), AD - CLIP shows its superior performance compared to existing methods, especially the significant improvement in domain adaptation tasks. ### Summary The core objective of AD - CLIP is to effectively solve the domain adaptation problem by using CLIP's pre - trained visual and text encoders to design a domain - independent and category - generalized prompt learning strategy in the prompt space. This method not only improves the generalization ability of the model between different domains but also provides new ideas for future research.