Abstract:Stochastic Human Motion Prediction (HMP) aims to predict multiple possible future human pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in the latent space, which does not preserve motion's spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction of corrupted motion. Then, a graph convolutional network (GCN) is employed to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions, while maintaining appropriate diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods across metrics, while demonstrating superior generation quality. Our Code is released at <a class="link-external link-https" href="https://github.com/jsun57/CoMusion/" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion" aims to solve several key problems in Stochastic Human Motion Prediction (HMP): 1. **Complexity of multi - stage training**: - Most of the existing high - performance methods require complex multi - stage training processes to improve prediction performance. These methods usually need multiple training rounds to cover different motion patterns and verify the effectiveness of motion, which leads to cumbersome model tuning work and makes them unattractive in many application scenarios. 2. **Consistency and realism of prediction results**: - Existing stochastic HMP methods often generate motions that are inconsistent with or even unrealistic compared to the provided historical data. In order to regularize the prediction and enhance diversity, these methods usually introduce explicit diversity - promoting losses or construct additional sampling spaces, but these methods often lead to sub - optimal predictions and sometimes are completely inconsistent with physical reality. 3. **Model design gap**: - Deterministic HMP methods have achieved good results by combining Graph Convolutional Networks (GCN) and Discrete Cosine Transform (DCT) to model spatio - temporal relationships. However, most stochastic HMP methods learn motion distributions by encoding - decoding in the latent space, which fails to preserve the spatio - temporal structure of motion, resulting in problems of prediction consistency and realism. ### Solutions To solve the above problems, the authors propose CoMusion, a single - stage end - to - end diffusion model framework for consistent stochastic HMP. Specifically: 1. **Smooth future pose initialization**: - CoMusion generates a smooth future pose initialization by using a Transformer network to initially reconstruct the noisy motion. This strategy has been proven effective in deterministic models but has not been fully utilized in stochastic models. 2. **GCN - DCT design**: - The motion sequence after initial reconstruction is spliced with historical observation data in the DCT space and refined by a Graph Convolutional Network (GCN) to capture spatio - temporal dependencies. This design simplifies the learning process and improves the accuracy, realism and consistency of prediction. 3. **Direct motion prediction**: - CoMusion adopts a direct motion prediction strategy instead of the common noise prediction scheme. This method allows the model to integrate structure - aware losses and further simplifies the learning process. 4. **Improved variance scheduler**: - By adjusting the standard cosine variance scheduler, CoMusion improves the accuracy and diversity of generated motion samples. ### Experimental results The experimental results show that CoMusion significantly outperforms existing methods on multiple benchmark datasets, especially in terms of prediction accuracy and the realism of generated samples. Specifically, it is manifested in the following aspects: - **Prediction accuracy**: On the Human3.6M dataset, CoMusion is far ahead of other methods in terms of ADE and FDE metrics. - **Behavior consistency**: CoMusion has increased by 35% and 51% respectively in CMD and FID metrics, showing a significant improvement in the consistency and realism of generated behaviors. - **Diversity**: Although CoMusion is not the highest in the diversity metric (APD), it performs best in the APDE metric, indicating that it can correctly model future randomness. ### Conclusion CoMusion realizes efficient and high - performance stochastic HMP by combining the designs of Transformer and GCN to capture the spatio - temporal dynamics of human motion in the DCT space. This method not only performs well in prediction accuracy, but also can generate consistent and realistic motion samples, and has broad application prospects.

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

Towards Accurate 3D Human Motion Prediction from Incomplete Observations

Spatiotemporal Consistency Learning from Momentum Cues for Human Motion Prediction

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

A Mixture of Experts Approach to 3D Human Motion Prediction

MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction

A Stochastic Conditioning Scheme for Diverse Human Motion Prediction

Class-guided Human Motion Prediction Via Multi-Spatial-temporal Supervision

TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction

AMHGCN: Adaptive multi-level hypergraph convolution network for human motion prediction

DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

Efficient Human Motion Prediction Using Temporal Convolutional Generative Adversarial Network

MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion

Human Motion Prediction Using Manifold-Aware Wasserstein GAN

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction

Human Motion Prediction Based on Space-Time-Separable Graph Convolutional Network

Towards Realistic 3D Human Motion Prediction with A Spatio-temporal Cross-transformer Approach