CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

Jiarui Sun,Girish Chowdhary
2024-08-20
Abstract:Stochastic Human Motion Prediction (HMP) aims to predict multiple possible future human pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in the latent space, which does not preserve motion's spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction of corrupted motion. Then, a graph convolutional network (GCN) is employed to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions, while maintaining appropriate diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods across metrics, while demonstrating superior generation quality. Our Code is released at <a class="link-external link-https" href="https://github.com/jsun57/CoMusion/" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion" aims to solve several key problems in Stochastic Human Motion Prediction (HMP): 1. **Complexity of multi - stage training**: - Most of the existing high - performance methods require complex multi - stage training processes to improve prediction performance. These methods usually need multiple training rounds to cover different motion patterns and verify the effectiveness of motion, which leads to cumbersome model tuning work and makes them unattractive in many application scenarios. 2. **Consistency and realism of prediction results**: - Existing stochastic HMP methods often generate motions that are inconsistent with or even unrealistic compared to the provided historical data. In order to regularize the prediction and enhance diversity, these methods usually introduce explicit diversity - promoting losses or construct additional sampling spaces, but these methods often lead to sub - optimal predictions and sometimes are completely inconsistent with physical reality. 3. **Model design gap**: - Deterministic HMP methods have achieved good results by combining Graph Convolutional Networks (GCN) and Discrete Cosine Transform (DCT) to model spatio - temporal relationships. However, most stochastic HMP methods learn motion distributions by encoding - decoding in the latent space, which fails to preserve the spatio - temporal structure of motion, resulting in problems of prediction consistency and realism. ### Solutions To solve the above problems, the authors propose CoMusion, a single - stage end - to - end diffusion model framework for consistent stochastic HMP. Specifically: 1. **Smooth future pose initialization**: - CoMusion generates a smooth future pose initialization by using a Transformer network to initially reconstruct the noisy motion. This strategy has been proven effective in deterministic models but has not been fully utilized in stochastic models. 2. **GCN - DCT design**: - The motion sequence after initial reconstruction is spliced with historical observation data in the DCT space and refined by a Graph Convolutional Network (GCN) to capture spatio - temporal dependencies. This design simplifies the learning process and improves the accuracy, realism and consistency of prediction. 3. **Direct motion prediction**: - CoMusion adopts a direct motion prediction strategy instead of the common noise prediction scheme. This method allows the model to integrate structure - aware losses and further simplifies the learning process. 4. **Improved variance scheduler**: - By adjusting the standard cosine variance scheduler, CoMusion improves the accuracy and diversity of generated motion samples. ### Experimental results The experimental results show that CoMusion significantly outperforms existing methods on multiple benchmark datasets, especially in terms of prediction accuracy and the realism of generated samples. Specifically, it is manifested in the following aspects: - **Prediction accuracy**: On the Human3.6M dataset, CoMusion is far ahead of other methods in terms of ADE and FDE metrics. - **Behavior consistency**: CoMusion has increased by 35% and 51% respectively in CMD and FID metrics, showing a significant improvement in the consistency and realism of generated behaviors. - **Diversity**: Although CoMusion is not the highest in the diversity metric (APD), it performs best in the APDE metric, indicating that it can correctly model future randomness. ### Conclusion CoMusion realizes efficient and high - performance stochastic HMP by combining the designs of Transformer and GCN to capture the spatio - temporal dynamics of human motion in the DCT space. This method not only performs well in prediction accuracy, but also can generate consistent and realistic motion samples, and has broad application prospects.