On human motion prediction using recurrent neural networks

Julieta Martinez,Michael J. Black,Javier Romero
DOI: https://doi.org/10.48550/arXiv.1705.02445
2017-05-06
Abstract:Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to solve several key problems in human motion prediction. Specifically, it attempts to: 1. **Improve the performance of short - term motion prediction**: The existing deep recurrent neural network (RNN) methods have obvious discontinuity problems in short - term motion prediction, especially in the first frame of prediction. This makes these methods perform poorly in practical applications, such as visual tracking. The paper proposes a new method to solve this problem. By introducing the residual architecture and the sampling - based loss function, the prediction becomes smoother and has smaller errors. 2. **Reduce the complexity of hyper - parameter tuning**: Existing methods usually require complex hyper - parameter tuning, especially the setting of noise scheduling. This tuning is not only difficult to carry out, but may also affect the final performance of the model. The method proposed in the paper does not require additional hyper - parameter tuning, simplifying the model training process. 3. **Simplify the model structure**: Existing methods usually use multi - layer LSTM or SRNN. Although these models perform well on certain tasks, they are computationally expensive and difficult to train. The paper proposes to use a single - layer GRU and does not use a spatial encoding layer, thus greatly simplifying the model structure while maintaining or even improving the prediction performance. 4. **Explore the training of multi - action models**: Existing methods usually model specific actions, while the paper attempts to train a single model that can handle multiple actions. In this way, the model can better utilize the regularity in large - scale datasets and improve the overall prediction performance. ### Main contributions of the paper - **Proposed a new sequence - to - sequence (seq2seq) architecture**: This architecture uses a sampling - based loss function during the training process, enabling the model to better recover from its own mistakes during prediction and reducing prediction discontinuity. - **Introduced the residual architecture**: By adding residual connections between the input and output of each RNN unit, the model can better represent the continuity of motion, especially in the first frame of prediction. - **Simplified the model structure**: Using a single - layer GRU instead of multi - layer LSTM or SRNN not only reduces the computational cost but also improves the training efficiency of the model. - **Explored the training of multi - action models**: By training a single model that can handle multiple actions, the paper shows the potential of this method in improving prediction performance. ### Experimental results The paper verifies the effectiveness of the proposed method through a series of experiments: 1. **Sequence - to - sequence architecture and sampling - based loss**: The experimental results show that the sequence - to - sequence architecture using the sampling - based loss function performs comparably to existing methods in short - term motion prediction and generates more reasonable motion in long - term prediction. 2. **Residual architecture**: After introducing the residual architecture, the error of the model in short - term prediction is significantly reduced, and the prediction is smoother. 3. **Multi - action model**: Training a single model that can handle multiple actions not only improves the prediction performance but also shows the advantages of the model in handling large - scale datasets. In general, through proposing new architectures and methods, this paper effectively solves the deficiencies of existing methods in human motion prediction and provides new ideas for further research in this field.