Abstract:Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the-proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.

How Deep Neural Networks Understand Motion? Toward Interpretable Motion Modeling by Leveraging the Relative Change in Position

MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions

Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction

Disentangled Neural Relational Inference for Interpretable Motion Prediction

Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks

Continuous Motion Recognition In Depth Camera Based On Recurrent Neural Networks And Grid-Based Average Depth

On human motion prediction using recurrent neural networks

3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions

Deep Networks for Image Motion Estimation

Embodied learning of a generative neural model for biological motion perception and inference

Unsupervised Learning of Long-Term Motion Dynamics for Videos

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

MoNet: Motion-Based Point Cloud Prediction Network

Learning to Represent Mechanics via Long-term Extrapolation and Interpolation

A multilayer human motion prediction perceptron by aggregating repetitive motion

MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps

Nonparametric motion model.

Multitask Non-Autoregressive Model For Human Motion Prediction

Spatio-temporal Manifold Learning for Human Motions via Long-horizon Modeling