MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting
Aayush Prakash,Daeil Kim,F. De,Chen Wu,La Torre,∗. JoseRibeiro-Gomes,Alexandre Bernardino,∗. TianhuiCai,Shingo Takagi,Z. '. Milacski,Amaury Aubel
DOI: https://doi.org/10.1109/WACV57701.2024.00499
2024-01-03
Abstract:There are numerous applications for human motion synthesis, including animation, gaming, robotics, or sports science. In recent years, human motion generation from natural language has emerged as a promising alternative to costly and labor-intensive data collection methods relying on motion capture or wearable sensors (e.g., suits). Despite this, generating human motion from textual descriptions remains a challenging and intricate task, primarily due to the scarcity of large-scale supervised datasets capable of capturing the full diversity of human activity.This study proposes a new approach, called MotionGPT, to address the limitations of previous text-based human motion generation methods by utilizing the extensive semantic information available in large language models (LLMs). We first pretrain a doubly text-conditional motion diffusion model on both coarse ("high-level") and detailed ("low-level") ground truth text data. Then during inference, we improve motion diversity and alignment with the training set, by zero-shot prompting GPT-3 for additional "low-level" details. Our method achieves new state-of-the-art quantitative results in terms of Fréchet Inception Distance (FID) and motion diversity metrics, and improves all considered metrics. Furthermore, it has strong qualitative performance, producing natural results. Code is available at https://github.com/humansensinglab/MotionGPT
Computer Science,Engineering