Unsupervised Learning of View-invariant Action Representations

Junnan Li,Yongkang Wong,Qi Zhao,Mohan S. Kankanhalli
DOI: https://doi.org/10.48550/arXiv.1809.01844
2018-09-06
Abstract:The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use unlabeled data to learn view - invariant action representations when performing human action recognition in videos, so as to reduce the dependence on a large amount of manually - labeled data. Specifically, most of the existing deep - learning methods adopt the supervised - learning paradigm when performing action recognition, which requires a large amount of manually - labeled data to achieve good performance. However, the collection of labeled data is both expensive and time - consuming. Therefore, this paper proposes an unsupervised - learning framework to use unlabeled data to learn video representations. Different from previous video - representation - learning work, the unsupervised - learning task in this paper is to use the video representation of the source view to predict 3D motions in multiple target views. By learning the extrapolation of cross - view motions, the representation can capture the view - invariant motion dynamics that are discriminative for actions. In addition, the author also proposes a view - adversarial - training method to enhance the learning of view - invariant features. Finally, the author shows the action - recognition effect of the learned representation on multiple datasets.