Abstract:Human motion prediction is consisting in forecasting future body poses from historically observed sequences. It is a longstanding challenge due to motion's complex dynamics and uncertainty. Existing methods focus on building up complicated neural networks to model the motion dynamics. The predicted results are required to be strictly similar to the training samples with L2 loss in current training pipeline. However, little attention has been paid to the uncertainty property which is crucial to the prediction task. We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result. In addition, existing works calculate the predicted error on each future frame equally during training, while recent work indicated that different frames could play different roles. In this work, a novel computationally efficient encoder-decoder model with uncertainty consideration is proposed, which could learn proper characteristics for future frames by a dynamic function. Experimental results on benchmark datasets demonstrate that our uncertainty consideration approach has obvious advantages both in quantity and quality. Moreover, the proposed method could produce motion sequences with much better quality that avoids the intractable shaking artefacts. We believe our work could provide a novel perspective to consider the uncertainty quality for the general motion prediction task and encourage the studies in this field. The code will be available in
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to consider uncertainty in human motion prediction tasks. Specifically, existing methods usually require that the prediction results be strictly similar to the training samples during the training process and use the L2 loss function for optimization. However, this method ignores the uncertain nature of the motion itself, that is, different actions may start from similar human postures, but their future developments may be very different. Therefore, the paper proposes a new encoder - decoder model, which can learn the appropriate features of future frames through dynamic functions and especially emphasizes the importance of uncertainty in prediction tasks.
### Core problems of the paper
1. **Limitations of existing methods**:
- Existing methods mainly focus on constructing complex neural networks to model motion dynamics, but ignore the uncertainty of motion.
- The prediction error has the same weight for each future frame during the training process, while in fact the importance of different frames is different.
- Past motion sequences are regarded as the definite results of future motion, rather than an observation of possible results.
2. **Importance of uncertainty**:
- Future motion is highly uncertain, especially in non - periodic behaviors.
- Different frames have different uncertainties. Short - term prediction is relatively easy, while long - term prediction is more difficult and has greater diversity.
3. **Solutions in the paper**:
- A new encoder - decoder model is proposed, which combines the Self - Attention Graph Generation Block (SAGGB) and the Temporal Convolutional Network (TCN) module to extract spatial and temporal information.
- The Adaptive - Salient Loss is introduced. This loss function can dynamically adjust the weights of different frames, so as to better handle uncertainty.
### Specific methods
1. **Self - Attention Graph Generation Block (SAGGB)**:
- Generate data - driven graphs through the self - attention mechanism to model the complexity of different actions and behaviors.
- Generate an attention graph for each pose, reflecting the dependencies between joints.
2. **Adaptive Loss**:
- Based on the probability model, regard the prediction of each frame as an independent task.
- Dynamically adjust the weights of different frames to reflect the characteristic that uncertainty increases over time.
3. **Salient Loss**:
- Emphasize the importance of the first frame as the initial state of the prediction sequence.
- Highlight the importance of the initial pose through a fixed value \(\omega\).
4. **Final loss function**:
- The final loss function is a weighted combination of the adaptive loss and the salient loss:
\[
L=\lambda L_{\text{Adaptive}}+(1 - \lambda) L_{\text{Salient}}
\]
### Experimental results
- **Short - and medium - term prediction**: On the H3.6M, CMU Mocap and 3DPW datasets, the method proposed in the paper outperforms the baseline methods in most short - and medium - term prediction tasks.
- **Long - term prediction**: Although the number of parameters is small, this method also shows competitiveness in long - term prediction tasks, especially in reducing noise and uncertainty.
- **Computational complexity**: The computational complexity of this model is low, the inference time is short, and it has high efficiency.
### Conclusion
By introducing the consideration of uncertainty, this paper proposes a new encoder - decoder model and an adaptive - salient loss function, which effectively improve the accuracy and robustness of human motion prediction. The experimental results show that this method has achieved significant performance improvements on multiple benchmark datasets.