Abstract:In this work, we tackle the challenging problem of unsupervised video domain adaptation (UVDA) for action recognition. We specifically focus on scenarios with a substantial domain gap, in contrast to existing works primarily deal with small domain gaps between labeled source domains and unlabeled target domains. To establish a more realistic setting, we introduce a novel UVDA scenario, denoted as Kinetics->BABEL, with a more considerable domain gap in terms of both temporal dynamics and background shifts. To tackle the temporal shift, i.e., action duration difference between the source and target domains, we propose a global-local view alignment approach. To mitigate the background shift, we propose to learn temporal order sensitive representations by temporal order learning and background invariant representations by background augmentation. We empirically validate that the proposed method shows significant improvement over the existing methods on the Kinetics->BABEL dataset with a large domain gap. The code is available at <a class="link-external link-https" href="https://github.com/KHUVLL/GLAD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve Unsupervised Video Domain Adaptation (UVDA) in video action recognition when there is a large domain gap between the source domain and the target domain. Specifically, the paper focuses on how to use the labeled source - domain data to improve the performance of the model on the unlabeled target domain without the target - domain labels. To address this challenge, the paper introduces a new benchmark dataset - Kinetics → BABEL, which has significant domain gaps in both temporal dynamics and background distribution, thus providing a more realistic and challenging scenario for studying UVDA. ### Specific problems solved by the paper 1. **Domain gap**: Existing UVDA methods mainly deal with small domain gaps between the source domain and the target domain. However, in practical applications, such as the transition from the real world to the synthetic world, or from day to night, the domain gap is often larger. These problems require more powerful adaptation methods to solve. 2. **Temporal dynamics difference**: In the Kinetics → BABEL dataset, videos in the source domain (Kinetics) are usually long, while videos in the target domain (BABEL) are short. This difference in temporal dynamics makes the direct application of existing UVDA methods ineffective. 3. **Background distribution difference**: Videos in the Kinetics dataset have diverse backgrounds, while videos in the BABEL dataset have consistent and simple backgrounds (such as a grayscale checkerboard background). This difference in background distribution may cause the model to rely too much on background information rather than truly understanding the action itself. ### Proposed methods To solve the above problems, the paper proposes two key techniques: 1. **Global - Local View Alignment (GLA)**: - **Global and local views**: Extract global and local feature vectors through different sampling strategies to capture information at different time scales. - **Domain alignment**: Use multiple domain classifiers (including global - global, local - local, and cross - scale alignment) to align the feature vectors of the source domain and the target domain. Introduce adversarial training through the Gradient Reversal Layer (GRL) to enable the model to learn domain - invariant representations. 2. **Background Debiasing**: - **Background enhancement**: Generate enhanced videos by mixing different backgrounds to encourage the model to learn background - invariant representations. - **Temporal order learning**: Further regularize model training by predicting the temporal order of video segments and reduce the dependence on static backgrounds. ### Experimental verification The paper conducted extensive experiments on the Kinetics → BABEL dataset to verify the effectiveness of the proposed method. The experimental results show that the GLAD method shows significant advantages in dealing with large - scale domain gaps, especially when there are large differences in background and temporal dynamics. ### Main contributions 1. **Introduced the Kinetics → BABEL dataset**: This dataset has significant domain gaps in both temporal dynamics and background distribution, providing a more challenging benchmark for UVDA research. 2. **Proposed the GLAD method**: Through global - local view alignment and background debiasing techniques, effectively solve the challenges brought by differences in temporal dynamics and background distribution. 3. **Experimentally proved the effectiveness of the method**: The experimental results on the Kinetics → BABEL dataset demonstrate the superior performance of the GLAD method. Through these contributions, the paper provides new directions and tools for the research of unsupervised video domain adaptation.

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

Entropy Guided Attention Network for Weakly-Supervised Action Localization.

Global Adaptation Meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation.

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

Cross-domain video action recognition via adaptive gradual learning

Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective

Object-based (yet Class-agnostic) Video Domain Adaptation

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Video domain adaptation for semantic segmentation using perceptual consistency matching

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

Adversarial Bipartite Graph Learning for Video Domain Adaptation

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation

Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images

Spatio-temporal Contrastive Domain Adaptation for Action Recognition

Transferable-guided Attention Is All You Need for Video Domain Adaptation

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Adaptation in Semantic Segmentation