A Mixture of Experts Approach to 3D Human Motion Prediction

Edmund Shieh,Joshua Lee Franco,Kang Min Bae,Tej Lalvani
2024-05-10
Abstract:This project addresses the challenge of human motion prediction, a critical area for applications such as au- tonomous vehicle movement detection. Previous works have emphasized the need for low inference times to provide real time performance for applications like these. Our primary objective is to critically evaluate existing model ar- chitectures, identifying their advantages and opportunities for improvement by replicating the state-of-the-art (SOTA) Spatio-Temporal Transformer model as best as possible given computational con- straints. These models have surpassed the limitations of RNN-based models and have demonstrated the ability to generate plausible motion sequences over both short and long term horizons through the use of spatio-temporal rep- resentations. We also propose a novel architecture to ad- dress challenges of real time inference speed by incorpo- rating a Mixture of Experts (MoE) block within the Spatial- Temporal (ST) attention layer. The particular variation that is used is Soft MoE, a fully-differentiable sparse Transformer that has shown promising ability to enable larger model capacity at lower inference cost. We make out code publicly available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in **human motion prediction**, especially how to improve the real - time inference speed while ensuring high precision. This problem is crucial for application fields such as autonomous vehicles and interactive robots, because these application scenarios need to quickly and accurately understand and predict the actions of pedestrians to enhance safety and interactive experience. ### Main objectives of the paper: 1. **Evaluate existing model architectures**: The paper first critically evaluates the existing model architectures and identifies their advantages and opportunities for improvement. Specifically, the author attempts to replicate the state - of - the - art Spatio - Temporal Transformer model proposed by Aksan et al. as much as possible, despite the limitations of computing resources. 2. **Propose a new architecture**: To address the challenge of real - time inference speed, the paper proposes a new architecture, that is, introducing the "Mixture of Experts (MoE)" block in the attention layer of the Spatio - Temporal Transformer. This method optimizes the inference speed by dynamically selecting the most relevant model components, thereby reducing the inference cost while maintaining high precision. ### Key technical points: - **Spatio - Temporal Transformer**: This model overcomes the limitations of RNN - based models in generating reasonable motion sequences by jointly representing spatial and temporal information. - **Mixture of Experts (MoE)**: The paper adopts the Soft MoE technology, which is a fully differentiable sparse transformer and can reduce the inference cost while increasing the model capacity. Specifically, MoE significantly reduces the computational overhead by a gating mechanism to select which expert networks to activate. ### Experiments and results: - **Baseline model performance analysis**: The author first trains multiple baseline models, including seq2seq, RNN, etc. with default parameters and compares their performance. The results show that the seq2seq model outperforms other models on multiple metrics. - **Hyperparameter tuning**: By performing hyperparameter tuning on the Spatio - Temporal Transformer, the author finds that the tuned model performs best in terms of training loss and validation loss, although the MAE value is similar to that of the seq2seq model. - **MoE vs. Vanilla ST Transformer**: The experimental results show that the MoE model can significantly improve the inference efficiency without sacrificing performance. Specifically, when the number of parameters is increased, the growth of the inference time of the MoE model is much lower than that of the traditional model. ### Conclusion: The paper successfully implements the current state - of - the - art Spatio - Temporal Transformer model and proposes a new architecture combined with MoE technology on this basis. The experimental results show that the new architecture significantly improves the inference speed while maintaining high precision, which is of great significance for application scenarios requiring real - time processing.