Abstract:Action recognition has received increasing attention from the computer vision and machine learning communities in the last decade. Although many related action recognition algorithms have been proposed, similar environments conditions are often required in the training and testing stages, which limits the application of the related technologies. In order to accelerate the generalization of action recognition, in this paper, the cross-domain action recognition problem are explored by three different kinds of aspects: 1) feature learning, hand-crafted feature and deep learning feature are extracted, respectively, and then the generalization ability of them are assessed and discussed on controlled and uncontrolled environments, respectively; 2) unsupervised cross-domain learning, since it is difficult for us to obtain the labeled samples in the target domain, thus, unsupervised cross-domain learning methods can be borrowed. In order to discuss which one is suitable for open domain action recognition problem, thus, three kind of unsupervised cross-domain learning methods are assessed on open domain action recognition dataset, respectively; 3) supervised cross-domain learning, if there are some labeled samples in the target domain, but the number of them is very limited, thus, supervised cross-domain learning method should be a good choice, but, how do we make the decision for them? Therefore, these methods are also appraised on the same dataset. Moreover, we contribute a novel multi-view and multi-modality human action recognition dataset (abbreviated as ” $MMA$ ”). It consists of 7,080 action samples from 25 action categories, including 15 single-subject actions and 10 double-subject interactive actions in three views of two different scenarios, which can be utilized to simultaneously explore single-view learning, multi-view learning, multi-modality learning, and cross-domain learning problems. We further explore the same learning problems on the MMA dataset. The extensive experimental results on two different datasets show that the deep feature learning method has much better generalization ability than the hand-crafted feature, such as improved dense trajectory if there are enough labeled samples in the training dataset to be used to fine-tune the network, and both unsupervised cross-domain learning method and supervised cross-domain learning method can improve the performance, but the latter can obtain much bigger improvement, in other words, the labeled samples in the target domain are very helpful. Finally, we also attended the open domain action recognition challenge which was held in CVPR 2017 workshop, and our supervised cross-domain learning scheme obtained the best performance in all teams.

Cross-domain video action recognition via adaptive gradual learning

Attention-based Cross-Layer Domain Alignment for Unsupervised Domain Adaptation

Multiscale Attention-Based Subdomain Dynamic Adaptation for Cross-Domain Scene Classification

Dynamic Video Mix-Up for Cross-Domain Action Recognition

Exploring the Cross-Domain Action Recognition Problem by Deep Feature Learning and Cross-Domain Learning

GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

Spatio-temporal Contrastive Domain Adaptation for Action Recognition

Learning Distinctive Margin Toward Active Domain Adaptation

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Unsupervised Domain Adaptation for Action Recognition via Self-Ensembling and Conditional Embedding Alignment

Video domain adaptation for semantic segmentation using perceptual consistency matching

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Object-based (yet Class-agnostic) Video Domain Adaptation

Self-Guided Adaptation: Progressive Representation Alignment for Domain Adaptive Object Detection

Cross-domain few-shot action recognition with unlabeled videos

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Unsupervised Domain Adaptation with Unified Joint Distribution Alignment