Giacomo Zara,Victor Guilherme Turrisi da Costa,Subhankar Roy,Paolo Rota,Elisa Ricci
Abstract:In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in Open - Set Video Domain Adaptation (OUVDA). Specifically, the paper focuses on how to transfer a pre - trained model from a labeled dataset (source domain) to an unlabeled dataset (target domain) when the target dataset contains unknown classes (i.e., target - private classes or OOD classes), and be able to correctly distinguish between shared classes and unknown classes.
### Problem Description
1. **Domain Adaptation Problem**: Traditional unsupervised video domain adaptation methods assume that the source domain and the target domain have the same label space (closed - set scenario). However, in the real world, the target domain may contain new classes that do not exist in the source domain (open - set scenario), which makes the existing closed - set domain adaptation methods no longer applicable.
2. **Open - Set Challenges**: In the open - set scenario, the target domain not only contains classes shared with the source domain but may also contain some unknown classes. If these unknown classes are misclassified as shared classes, it will lead to a decline in model performance. Therefore, how to correctly identify and exclude these unknown classes during the transfer learning process is a key issue.
### Solutions Proposed in the Paper
The paper proposes a contrastive - learning - based framework named COLOSEO (COntrastive Learning for Open - SET Video Domain Adaptation) to solve the above problems in the following ways:
1. **Contrastive Learning Framework**: Generate discriminative feature representations through contrastive learning, so that the shared classes in the source domain and the target domain can be well - aligned, while the unknown classes can be effectively separated.
2. **Temporal Contrastive Loss**: Introduce a video - oriented temporal contrastive loss, which utilizes the inherent temporal information in video data to further enhance the model's ability to distinguish action classes.
3. **Pseudo - Label and Unknown - Class Detection**: Automatically detect unknown - class samples in the target domain by calculating the similarity between target samples and source - domain class prototypes, and exclude them from cross - domain feature alignment.
4. **Multi - Task Learning**: The final model can not only classify shared classes but also identify and classify unknown - class samples as a new class (K + 1 class).
### Formula Summary
- **Label - based Contrastive Loss**:
\[
L_{\text{sup}}^i = -\log\frac{\sum_{j = 1}^{2b}1(y_S^i = y_S^j)\exp(\frac{\text{sim}(\bar{z}_S^i,\bar{z}_S^j)}{\tau})}{\sum_{k = 1}^{2b}1(k\neq i)1(y_S^k\neq y_S^i)\exp(\frac{\text{sim}(\bar{z}_S^i,\bar{z}_S^k)}{\tau})}
\]
- **Augmentation - based Contrastive Loss**:
\[
L_{\text{aug}}^i = -\log\frac{\exp(\frac{\text{sim}(z_T^i,\tilde{z}_T^i)}{\tau})}{\sum_{k = 1}^{2b}1(k\neq i)\exp(\frac{\text{sim}(z_T^i,z_T^k)}{\tau})}
\]
- **Temporal Contrastive Loss**:
\[
L_{\text{temp}}^i=\max\{d(h_i,\tilde{h}_i)-d(h_i,h_i^{-})+\alpha,0\}
\]
- **Cross - domain Contrastive Loss**