Abstract:Action recognition has received increasing attention from the computer vision and machine learning communities in the last decade. Although many related action recognition algorithms have been proposed, similar environments conditions are often required in the training and testing stages, which limits the application of the related technologies. In order to accelerate the generalization of action recognition, in this paper, the cross-domain action recognition problem are explored by three different kinds of aspects: 1) feature learning, hand-crafted feature and deep learning feature are extracted, respectively, and then the generalization ability of them are assessed and discussed on controlled and uncontrolled environments, respectively; 2) unsupervised cross-domain learning, since it is difficult for us to obtain the labeled samples in the target domain, thus, unsupervised cross-domain learning methods can be borrowed. In order to discuss which one is suitable for open domain action recognition problem, thus, three kind of unsupervised cross-domain learning methods are assessed on open domain action recognition dataset, respectively; 3) supervised cross-domain learning, if there are some labeled samples in the target domain, but the number of them is very limited, thus, supervised cross-domain learning method should be a good choice, but, how do we make the decision for them? Therefore, these methods are also appraised on the same dataset. Moreover, we contribute a novel multi-view and multi-modality human action recognition dataset (abbreviated as ” $MMA$ ”). It consists of 7,080 action samples from 25 action categories, including 15 single-subject actions and 10 double-subject interactive actions in three views of two different scenarios, which can be utilized to simultaneously explore single-view learning, multi-view learning, multi-modality learning, and cross-domain learning problems. We further explore the same learning problems on the MMA dataset. The extensive experimental results on two different datasets show that the deep feature learning method has much better generalization ability than the hand-crafted feature, such as improved dense trajectory if there are enough labeled samples in the training dataset to be used to fine-tune the network, and both unsupervised cross-domain learning method and supervised cross-domain learning method can improve the performance, but the latter can obtain much bigger improvement, in other words, the labeled samples in the target domain are very helpful. Finally, we also attended the open domain action recognition challenge which was held in CVPR 2017 workshop, and our supervised cross-domain learning scheme obtained the best performance in all teams.

Multi-Domain and Multi-Task Learning for Human Action Recognition

Exploring the Cross-Domain Action Recognition Problem by Deep Feature Learning and Cross-Domain Learning

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Cross-modality Online Distillation for Multi-View Action Recognition

MS<SUP>2</SUP>L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

Discriminative Multi-View Subspace Feature Learning for Action Recognition

Human action recognition via multi-view learning.

Continuous Multi-View Human Action Recognition

Discriminative Deep Multi-Task Learning for Facial Expression Recognition.

PTL-LTM model for complex action recognition using local-weighted NMF and deep dual-manifold regularized NMF with sparsity constraint

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Multilayer deep features with multiple kernel learning for action recognition

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

Multi-layer Representation for Cross-view Action Recognition

Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition

Human-Centered Prior-Guided and Task-Dependent Multi-Task Representation Learning for Action Recognition Pre-Training

HirMTL: Hierarchical Multi-Task Learning for dense scene understanding

Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition