Abstract:This project addresses the challenge of human motion prediction, a critical area for applications such as au- tonomous vehicle movement detection. Previous works have emphasized the need for low inference times to provide real time performance for applications like these. Our primary objective is to critically evaluate existing model ar- chitectures, identifying their advantages and opportunities for improvement by replicating the state-of-the-art (SOTA) Spatio-Temporal Transformer model as best as possible given computational con- straints. These models have surpassed the limitations of RNN-based models and have demonstrated the ability to generate plausible motion sequences over both short and long term horizons through the use of spatio-temporal rep- resentations. We also propose a novel architecture to ad- dress challenges of real time inference speed by incorpo- rating a Mixture of Experts (MoE) block within the Spatial- Temporal (ST) attention layer. The particular variation that is used is Soft MoE, a fully-differentiable sparse Transformer that has shown promising ability to enable larger model capacity at lower inference cost. We make out code publicly available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in **human motion prediction**, especially how to improve the real - time inference speed while ensuring high precision. This problem is crucial for application fields such as autonomous vehicles and interactive robots, because these application scenarios need to quickly and accurately understand and predict the actions of pedestrians to enhance safety and interactive experience. ### Main objectives of the paper: 1. **Evaluate existing model architectures**: The paper first critically evaluates the existing model architectures and identifies their advantages and opportunities for improvement. Specifically, the author attempts to replicate the state - of - the - art Spatio - Temporal Transformer model proposed by Aksan et al. as much as possible, despite the limitations of computing resources. 2. **Propose a new architecture**: To address the challenge of real - time inference speed, the paper proposes a new architecture, that is, introducing the "Mixture of Experts (MoE)" block in the attention layer of the Spatio - Temporal Transformer. This method optimizes the inference speed by dynamically selecting the most relevant model components, thereby reducing the inference cost while maintaining high precision. ### Key technical points: - **Spatio - Temporal Transformer**: This model overcomes the limitations of RNN - based models in generating reasonable motion sequences by jointly representing spatial and temporal information. - **Mixture of Experts (MoE)**: The paper adopts the Soft MoE technology, which is a fully differentiable sparse transformer and can reduce the inference cost while increasing the model capacity. Specifically, MoE significantly reduces the computational overhead by a gating mechanism to select which expert networks to activate. ### Experiments and results: - **Baseline model performance analysis**: The author first trains multiple baseline models, including seq2seq, RNN, etc. with default parameters and compares their performance. The results show that the seq2seq model outperforms other models on multiple metrics. - **Hyperparameter tuning**: By performing hyperparameter tuning on the Spatio - Temporal Transformer, the author finds that the tuned model performs best in terms of training loss and validation loss, although the MAE value is similar to that of the seq2seq model. - **MoE vs. Vanilla ST Transformer**: The experimental results show that the MoE model can significantly improve the inference efficiency without sacrificing performance. Specifically, when the number of parameters is increased, the growth of the inference time of the MoE model is much lower than that of the traditional model. ### Conclusion: The paper successfully implements the current state - of - the - art Spatio - Temporal Transformer model and proposes a new architecture combined with MoE technology on this basis. The experimental results show that the new architecture significantly improves the inference speed while maintaining high precision, which is of great significance for application scenarios requiring real - time processing.

A Mixture of Experts Approach to 3D Human Motion Prediction

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Towards Realistic 3D Human Motion Prediction with A Spatio-temporal Cross-transformer Approach

Toward Realistic 3D Human Motion Prediction with a Spatio-Temporal Cross- Transformer Approach

TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction

AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

Spatiotemporal Consistency Learning from Momentum Cues for Human Motion Prediction

Towards Efficient 3D Human Motion Prediction Using Deformable Transformer-based Adversarial Network

Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction

Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic.

GGTr: An Innovative Framework for Accurate and Realistic Human Motion Prediction

Spatial–temporal modeling for prediction of stylized human motion

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

STTG-net: a Spatio-temporal Network for Human Motion Prediction Based on Transformer and Graph Convolution Network

Learning Progressive Joint Propagation for Human Motion Prediction

3D Skeleton-Based Human Motion Prediction Using Spatial–temporal Graph Convolutional Network

Robust Human Motion Forecasting using Transformer-based Model

Multi-level Motion Attention for Human Motion Prediction

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

Multiscale Spatial and Temporal Learning for Human Motion Prediction

Towards Accurate 3D Human Motion Prediction from Incomplete Observations