Temporal Consistency Two-Stream CNN for Human Motion Prediction

Jin Tang,Jin Zhang,Jianqin Yin
DOI: https://doi.org/10.48550/arXiv.2104.05015
2021-04-11
Abstract:Fusion is critical for a two-stream network. In this paper, we propose a novel temporal fusion (TF) module to fuse the two-stream joints' information to predict human motion, including a temporal concatenation and a reinforcement trajectory spatial-temporal (TST) block, specifically designed to keep prediction temporal consistency. In particular, the temporal concatenation keeps the temporal consistency of preliminary predictions from two streams. Meanwhile, the TST block improves the spatial-temporal feature coupling. However, the TF module can increase the temporal continuities between the first predicted pose and the given poses and between each predicted pose. The fusion is based on a two-stream network that consists of a dynamic velocity stream (V-Stream) and a static position stream (P-Stream) because we found that the joints' velocity information improves the short-term prediction, while the joints' position information is better at long-term prediction, and they are complementary in motion prediction. Finally, our approach achieves impressive results on three benchmark datasets, including H3.6M, CMU-Mocap, and 3DPW in both short-term and long-term predictions, confirming its effectiveness and efficiency.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the spatio - temporal consistency problem in human motion prediction. Specifically, the existing two - stream networks often overlook temporal continuity and the coupling of spatio - temporal features when fusing joint position and velocity information. This results in temporally incoherent prediction results, especially when predicting future postures, there are obvious discontinuities between the predicted postures and the actual postures. In addition, there are differences in the performance of short - term prediction and long - term prediction, and it is impossible to achieve the optimal effect simultaneously. To overcome these problems, the paper proposes a new Temporal Fusion (TF) module, which maintains the temporal consistency of prediction and enhances the coupling of spatio - temporal features through temporal splicing and Reinforcement Trajectory Spatial - Temporal (TST) blocks. In addition, the paper also designs a two - stream framework, which uses the Dynamic Velocity Stream (V - Stream) and the Static Position Stream (P - Stream) respectively to model the velocity and position information of joints. The V - Stream is mainly used for short - term prediction, while the P - Stream is more suitable for long - term prediction. The two are complementary and jointly improve the accuracy and robustness of prediction. Through experimental verification on three benchmark datasets (H3.6M, CMU - Mocap and 3DPW), this method has achieved remarkable results in both short - term and long - term prediction, proving its effectiveness and high efficiency.