Abstract:In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in Open - Set Video Domain Adaptation (OUVDA). Specifically, the paper focuses on how to transfer a pre - trained model from a labeled dataset (source domain) to an unlabeled dataset (target domain) when the target dataset contains unknown classes (i.e., target - private classes or OOD classes), and be able to correctly distinguish between shared classes and unknown classes. ### Problem Description 1. **Domain Adaptation Problem**: Traditional unsupervised video domain adaptation methods assume that the source domain and the target domain have the same label space (closed - set scenario). However, in the real world, the target domain may contain new classes that do not exist in the source domain (open - set scenario), which makes the existing closed - set domain adaptation methods no longer applicable. 2. **Open - Set Challenges**: In the open - set scenario, the target domain not only contains classes shared with the source domain but may also contain some unknown classes. If these unknown classes are misclassified as shared classes, it will lead to a decline in model performance. Therefore, how to correctly identify and exclude these unknown classes during the transfer learning process is a key issue. ### Solutions Proposed in the Paper The paper proposes a contrastive - learning - based framework named COLOSEO (COntrastive Learning for Open - SET Video Domain Adaptation) to solve the above problems in the following ways: 1. **Contrastive Learning Framework**: Generate discriminative feature representations through contrastive learning, so that the shared classes in the source domain and the target domain can be well - aligned, while the unknown classes can be effectively separated. 2. **Temporal Contrastive Loss**: Introduce a video - oriented temporal contrastive loss, which utilizes the inherent temporal information in video data to further enhance the model's ability to distinguish action classes. 3. **Pseudo - Label and Unknown - Class Detection**: Automatically detect unknown - class samples in the target domain by calculating the similarity between target samples and source - domain class prototypes, and exclude them from cross - domain feature alignment. 4. **Multi - Task Learning**: The final model can not only classify shared classes but also identify and classify unknown - class samples as a new class (K + 1 class). ### Formula Summary - **Label - based Contrastive Loss**: \[ L_{\text{sup}}^i = -\log\frac{\sum_{j = 1}^{2b}1(y_S^i = y_S^j)\exp(\frac{\text{sim}(\bar{z}_S^i,\bar{z}_S^j)}{\tau})}{\sum_{k = 1}^{2b}1(k\neq i)1(y_S^k\neq y_S^i)\exp(\frac{\text{sim}(\bar{z}_S^i,\bar{z}_S^k)}{\tau})} \] - **Augmentation - based Contrastive Loss**: \[ L_{\text{aug}}^i = -\log\frac{\exp(\frac{\text{sim}(z_T^i,\tilde{z}_T^i)}{\tau})}{\sum_{k = 1}^{2b}1(k\neq i)\exp(\frac{\text{sim}(z_T^i,z_T^k)}{\tau})} \] - **Temporal Contrastive Loss**: \[ L_{\text{temp}}^i=\max\{d(h_i,\tilde{h}_i)-d(h_i,h_i^{-})+\alpha,0\} \] - **Cross - domain Contrastive Loss**

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Video domain adaptation for semantic segmentation using perceptual consistency matching

AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation

Spatio-temporal Contrastive Domain Adaptation for Action Recognition

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Cross-domain video action recognition via adaptive gradual learning

Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free Domain Adaptation for Video Semantic Segmentation

Object-based (yet Class-agnostic) Video Domain Adaptation

Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Heterogeneous Domain Adaptation Method for Video Annotation

Imbalanced Open Set Domain Adaptation via Moving-threshold Estimation and Gradual Alignment

Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition