Abstract:3D human motion prediction; predicting future human poses in the basis of historically observed motion sequences, is a core task in computer vision. Thus far, it has been successfully applied to both autonomous driving and human–robot interaction. Previous research work has usually employed Recurrent Neural Networks (RNNs)-based models to predict future human poses. However, as previous works have amply demonstrated, RNN-based prediction models suffer from unrealistic and discontinuous problems in human motion prediction due to the accumulation of prediction errors. To address this, we propose a feed-forward, 3D skeleton-based model for human motion prediction. This model, the Spatial–Temporal Graph Convolutional Network (ST-GCN) model, automatically learns the spatial and temporal patterns of human motion from input sequences. This model overcomes the limitations of previous research approaches. Specifically, our ST-GCN model is based on an encoder-decoder architecture. The encoder consists of 5 ST-GCN modules, with each ST-GCN module consisting of a spatial GCN layer and a 2D convolution-based TCN layer, which facilitate the encoding of the spatio-temporal dynamics of human motion. Subsequently, the decoder, consisting of 5 TCN layers, exploits the encoded spatio-temporal representation of human motion to predict future human pose. We leveraged the ST-GCN model to perform extensive experiments on various large-scale human activity 3D pose datasets (Human3.6 M, AMASS, 3DPW) while adopting MPJPE (Mean Per Joint Position Error) as the evaluation metric. The experimental results demonstrate that our ST-GCN model outperforms the baseline models in both short-term (< 400 ms) and long-term (> 400 ms) predictions, thus yielding the best prediction results.

Skip-attention Encoder–decoder Framework for Human Motion Prediction

Human Motion Prediction Based on Attention Mechanism

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Stacked residual blocks based encoder-decoder framework for human motion prediction

Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation.

Multi-level Motion Attention for Human Motion Prediction

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

Multiscale Spatial and Temporal Learning for Human Motion Prediction

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

Spatiotemporal Consistency Learning from Momentum Cues for Human Motion Prediction

Using Appearance to Predict Pedestrian Trajectories Through Disparity-Guided Attention and Convolutional LSTM

3D Skeleton-Based Human Motion Prediction Using Spatial–temporal Graph Convolutional Network

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

Geometric algebra-based multiscale encoder-decoder networks for 3D motion prediction

Learning Progressive Joint Propagation for Human Motion Prediction

Long-Term Human Motion Prediction with Scene Context

Adversarial Geometry-Aware Human Motion Prediction

Spatio-Temporal Encoding and Decoding-Based Method for Future Human Activity Skeleton Synthesis

PVRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction

Predicting Human Motion Using Key Subsequences

Past Movements-Guided Motion Representation Learning for Human Motion Prediction