Temporal Consistency Two-Stream CNN for Human Motion Prediction

Jin Tang,Jin Zhang,Jianqin Yin

DOI: https://doi.org/10.48550/arXiv.2104.05015

2021-04-11

Abstract:Fusion is critical for a two-stream network. In this paper, we propose a novel temporal fusion (TF) module to fuse the two-stream joints' information to predict human motion, including a temporal concatenation and a reinforcement trajectory spatial-temporal (TST) block, specifically designed to keep prediction temporal consistency. In particular, the temporal concatenation keeps the temporal consistency of preliminary predictions from two streams. Meanwhile, the TST block improves the spatial-temporal feature coupling. However, the TF module can increase the temporal continuities between the first predicted pose and the given poses and between each predicted pose. The fusion is based on a two-stream network that consists of a dynamic velocity stream (V-Stream) and a static position stream (P-Stream) because we found that the joints' velocity information improves the short-term prediction, while the joints' position information is better at long-term prediction, and they are complementary in motion prediction. Finally, our approach achieves impressive results on three benchmark datasets, including H3.6M, CMU-Mocap, and 3DPW in both short-term and long-term predictions, confirming its effectiveness and efficiency.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the spatio - temporal consistency problem in human motion prediction. Specifically, the existing two - stream networks often overlook temporal continuity and the coupling of spatio - temporal features when fusing joint position and velocity information. This results in temporally incoherent prediction results, especially when predicting future postures, there are obvious discontinuities between the predicted postures and the actual postures. In addition, there are differences in the performance of short - term prediction and long - term prediction, and it is impossible to achieve the optimal effect simultaneously. To overcome these problems, the paper proposes a new Temporal Fusion (TF) module, which maintains the temporal consistency of prediction and enhances the coupling of spatio - temporal features through temporal splicing and Reinforcement Trajectory Spatial - Temporal (TST) blocks. In addition, the paper also designs a two - stream framework, which uses the Dynamic Velocity Stream (V - Stream) and the Static Position Stream (P - Stream) respectively to model the velocity and position information of joints. The V - Stream is mainly used for short - term prediction, while the P - Stream is more suitable for long - term prediction. The two are complementary and jointly improve the accuracy and robustness of prediction. Through experimental verification on three benchmark datasets (H3.6M, CMU - Mocap and 3DPW), this method has achieved remarkable results in both short - term and long - term prediction, proving its effectiveness and high efficiency.

Temporal Consistency Two-Stream CNN for Human Motion Prediction

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Spatiotemporal Consistency Learning from Momentum Cues for Human Motion Prediction

STTG-net: a Spatio-temporal Network for Human Motion Prediction Based on Transformer and Graph Convolution Network

Spatio-Temporal Fusion Networks for Action Recognition

TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction

Temporal Interaction and Excitation for Action Recognition

Multiscale Spatial and Temporal Learning for Human Motion Prediction

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

A Hierarchical Static-Dynamic Encoder-Decoder Structure for 3D Human Motion Prediction with Residual CNNs

Multi-granularity Spatial Temporal Graph Convolution Network with Consecutive Attention for Human Motion Prediction

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

KSOF: Leveraging kinematics and spatio-temporal optimal fusion for human motion prediction

Multi-Scale Spatio-Temporal Aggregation Network for Human Motion Prediction.

Human Motion Prediction Via Dual-Attention and Multi-Granularity Temporal Convolutional Networks.

Towards Realistic 3D Human Motion Prediction with A Spatio-temporal Cross-transformer Approach

VH3D-LSFM: Video-Based Human 3D Pose Estimation with Long-Term and Short-Term Pose Fusion Mechanism.

Toward Realistic 3D Human Motion Prediction with a Spatio-Temporal Cross- Transformer Approach

Learning Progressive Joint Propagation for Human Motion Prediction

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

Multi-Channel Spatio-Temporal GCN for Human Pose Forecasting