MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty

Leo Bringer,Joey Wilson,Kira Barton,Maani Ghaffari
2024-10-05
Abstract:This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty. Existing methods for motion forecasting or motion generation rely solely on either prior motions or text prompts, facing limitations with precision or control, particularly over extended durations. The multi-modal nature of our approach enhances the contextual understanding of human motion, while our graph-based transformer framework effectively capture both spatial and temporal motion dynamics. As a result, our model consistently outperforms existing generative techniques in accurately predicting long-term motions. Additionally, by leveraging diffusion models' ability to capture different modes of prediction, we estimate uncertainty, significantly improving spatial awareness in human-robot interactions by incorporating zones of presence with varying confidence levels for each body joint.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate long - term human motion predictions that are accurate and with quantifiable uncertainty in human - robot collaboration (HRC). Specifically, existing human motion prediction or generation methods mainly rely on a single data source, such as using only past motion data or only using text prompts, which results in limited precision or control ability in long - term predictions. These problems are particularly prominent in dynamic collaborative scenarios that require precise interaction tasks, collision avoidance, and efficient trajectory planning. To solve the above problems, the paper proposes a multi - modal diffusion model (Multi - modal Diffusion Model for Motion Prediction, MDMP), which combines and synchronizes skeletal data and text descriptions of actions to generate more refined long - term motion predictions and can estimate uncertainty. Through this method, MDMP not only improves the contextual understanding of human motion, but also its graph - based Transformer framework effectively captures the motion dynamics in space and time, thus significantly outperforming existing generation techniques in accurately predicting long - term motion. In addition, by leveraging the ability of the diffusion model to capture different prediction modes, MDMP can also estimate uncertainty, further enhancing the spatial awareness in human - machine interaction, especially in the presence areas of different confidence levels around each body joint.