Mix-DANN and Dynamic-Modal-Distillation for Video Domain Adaptation

Yuehao Yin,Bin Zhu,Jingjing Chen,Lechao Cheng,Yu-Gang Jiang
DOI: https://doi.org/10.1145/3503161.3548313
2022-01-01
Abstract:Video domain adaptation is non-trivial due to video is inherently involved with multi-dimensional and multi-modal information. Existing works mainly adopt adversarial learning and self-supervised tasks to align features. Nevertheless, the explicit interaction between source and target in the temporal dimension, as well as the adaptation between modalities, are unexploited. In this paper, we propose Mix-Domain-Adversarial Neural Network and Dynamic-Modal-Distillation (MD-DMD), a novel multi-modal adversarial learning framework for unsupervised video domain adaptation. Our approach incorporates the temporal information between source and target domains, as well as the diversity of adaptability between modalities. On the one hand, for every single modality, we mix the frames from source and target domains to form mix-samples, then let the adversarial-discriminator predict the mix ratio of a mix-sample to further enhance the ability of the model to capture domain-invariant feature representations. On the other hand, we dynamically estimate the adaptability for different modalities during training, then pick the most adaptable modality as a teacher to guide other modalities by knowledge distillation. As a result, modalities are capable of learning transferable knowledge from each other, which leads to more effective adaptation. Experiments on two video domain adaptation benchmarks demonstrate the superiority of our proposed MD-DMD over state-of-the-art methods.
What problem does this paper attempt to address?