Abstract:Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that most of the existing human motion prediction methods based on Graph Convolutional Networks (GCNs) adopt a single - step scheme, directly predicting results from historical inputs and outputs, and failing to fully utilize human motion patterns. Specifically, although existing GCNs perform well in capturing spatial relationships within a single frame or graph, they lack an explicit mechanism to effectively model the temporal evolution process of human motion. These methods usually rely on information aggregation between different frames without explicitly considering the temporal context, for example, by using Temporal Convolutional Networks (TCNs) along the time axis. This limitation hinders their ability to capture sequence patterns, subtle motion transitions, and fine - grained temporal dependencies in human motion.
To solve this problem, the authors propose a new fragment - based method, which combines fragmented motion representations and a fragment - to - motion prediction framework. The core observation of this framework is that it is easier to predict several key postures than to predict the entire sequence, and human motion often exhibits a multi - stage pattern. Therefore, the paper proposes a phased framework that decomposes motion prediction into more tractable subtasks, each of which contains three modules: transition posture prediction, fragment reconstruction, and fragment - to - motion prediction. The specific steps are as follows:
1. **Transition Posture Prediction**: First, predict future transition points, that is, the specific transition postures of each sample.
2. **Fragment Reconstruction**: Then, use these transition postures to reconstruct the corresponding motion fragments through techniques such as linear interpolation to obtain an approximation close to the real motion sequence.
3. **Fragment - to - Motion Prediction**: Finally, assemble these fragments to generate the final predicted motion sequence and optimize it.
To implement this framework, the authors propose a new unified graph modeling method that allows for direct and efficient feature propagation. Compared with existing methods, the latter rely on independent spatial and temporal modeling. Experimental results show that this method achieves state - of - the - art performance on multiple benchmark datasets (such as Human3.6M, CMU Mocap, and 3DPW).
In summary, this paper aims to improve the accuracy and robustness of human motion prediction, especially in long - term prediction, by introducing fragmented motion representations and a multi - stage prediction framework.