Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

Xiaoyu Zhu,Junwei Liang,Po-Yao Huang,Alex Hauptmann
2024-03-25
Abstract:We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.
Computer Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the application of Unsupervised Domain Adaptation (UDA) in egocentric videos**, especially in action recognition tasks in different environments and scenarios. Specifically, the paper focuses on how to transfer a model trained on the source domain to the target domain without target - domain labels and improve its performance on the target domain. ### Background and Challenges of the Problem 1. **Domain Gap**: Due to the visual differences between the source domain and the target domain (such as background, lighting conditions, viewing angles, interaction objects, and motion changes, etc.), directly transferring the model will lead to a performance degradation. Existing methods usually align the feature distributions of the two domains through adversarial learning, but this method may cause the model to rely only on some simple features (such as lighting differences) and ignore other important factors. 2. **Lack of Fine - Grained Data Annotation**: In practical applications, obtaining a large number of egocentric videos with fine - grained annotations is very expensive and infeasible. Therefore, UDA has become an effective solution, which can perform model transfer without using target - domain labels. 3. **Limitations of Existing Benchmarks**: Existing UDA benchmarks mainly focus on a single environment (such as the kitchen), and the domain differences are small (for example, different kitchens are regarded as different domains). This makes it difficult for the model to cope with more complex and diverse real - world scenarios. ### Solutions in the Paper To solve the above problems, the author proposes the following innovations: 1. **Propose a New Benchmark U - Ego4D**: Based on the large - scale Ego4D dataset, a more challenging UDA benchmark U - Ego4D is constructed. This benchmark covers a variety of daily - life scenarios (such as home, outdoors, workplaces, etc.), and the same action can occur in different environments, increasing the complexity and diversity of the task. 2. **Introduce the Generative Adversarial Domain Alignment Network (GADAN)**: GADAN includes an adversarial mask generator and a domain - invariant encoder. The adversarial mask generator aims to generate challenging masks to maximize the distance between the source domain and the target domain; while the domain - invariant encoder minimizes the distance between domains through these mask samples. This adversarial mechanism can prevent the model from falling into simple feature alignment and thus learn more robust representations. 3. **Introduce the Masked Consistency Learning Module (MCL)**: The MCL module enhances the model's spatio - temporal context understanding ability and improves the category discrimination ability by forcing the model to maintain prediction consistency between the masked view and the complete view. Specifically, MCL uses pseudo - labels to guide the model's learning, ensuring that effective classification can also be carried out on unlabeled target - domain videos. ### Summary The main contributions of this paper are: - Proposing a new UDA benchmark U - Ego4D to evaluate the performance of video domain adaptation models in more complex and diverse scenarios. - Introducing a Transformer - based model, combined with GADAN and MCL modules, to learn effective domain - invariant and category - discriminative representations. - Achieving state - of - the - art performance on the Epic - Kitchens and U - Ego4D benchmarks, demonstrating the effectiveness of the proposed method. Through these innovations, the paper provides a new framework to deal with the domain gap problem in UDA and provides valuable references for future research.