Unsupervised Learning of View-invariant Action Representations

Junnan Li,Yongkang Wong,Qi Zhao,Mohan S. Kankanhalli

DOI: https://doi.org/10.48550/arXiv.1809.01844

2018-09-06

Abstract:The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use unlabeled data to learn view - invariant action representations when performing human action recognition in videos, so as to reduce the dependence on a large amount of manually - labeled data. Specifically, most of the existing deep - learning methods adopt the supervised - learning paradigm when performing action recognition, which requires a large amount of manually - labeled data to achieve good performance. However, the collection of labeled data is both expensive and time - consuming. Therefore, this paper proposes an unsupervised - learning framework to use unlabeled data to learn video representations. Different from previous video - representation - learning work, the unsupervised - learning task in this paper is to use the video representation of the source view to predict 3D motions in multiple target views. By learning the extrapolation of cross - view motions, the representation can capture the view - invariant motion dynamics that are discriminative for actions. In addition, the author also proposes a view - adversarial - training method to enhance the learning of view - invariant features. Finally, the author shows the action - recognition effect of the learned representation on multiple datasets.

Unsupervised Learning of View-invariant Action Representations

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition

View-invariant feature discovering for multi-camera human action recognition

Unsupervised View-Invariant Human Posture Representation

Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition.

Unsupervised learning using sequential verification for action recognition

View-invariant action recognition:a survey

View-Robust Neural Networks for Unseen Human Action Recognition in Videos

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Deeply Learned View-Invariant Features for Cross-View Action Recognition

View-Invariant Action Recognition Using Latent Kernelized Structural SVM

3D Human Action Representation Learning via Cross-View Consistency Pursuit

Continuous Multi-View Human Action Recognition

Unsupervised Representation Learning With Long-Term Dynamics for Skeleton Based Action Recognition

Cross-View Action Recognition Based on Hierarchical View-Shared Dictionary Learning.

View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos

Action Recognition Using Spatial-Optical Data Organization and Sequential Learning Framework