Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

Jing Lin,Ailing Zeng,Shunlin Lu,Yuanhao Cai,Ruimao Zhang,Haoqian Wang,Lei Zhang
2024-01-26
Abstract:In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing motion datasets in terms of scale, diversity, expressiveness, and scene coverage. Specifically, the existing motion datasets mainly contain body postures and lack facial expressions, hand gestures, and detailed posture descriptions. In addition, most of these datasets are collected in limited laboratory scenarios, and the text descriptions are manually annotated, which greatly limits their extensibility. To overcome these limitations, the author has developed a full - body motion and text annotation pipeline that can automatically annotate motions from single - view or multi - view videos and provide comprehensive semantic labels for each video as well as detailed full - body posture descriptions for each frame. Based on this pipeline, a large - scale 3D full - body motion dataset, Motion - X, has been constructed, aiming to improve expressiveness, diversity, and natural motion generation capabilities, as well as the accuracy of 3D full - body human mesh recovery. ### Main Problem Summary: 1. **Limitations of Body Posture Data**: Existing datasets mainly contain body postures and lack facial expressions and hand gestures. 2. **Insufficient Data Volume and Diversity**: The data volume and diversity of existing datasets are insufficient, mainly covering indoor scenes. 3. **Lack of Long - Sequence Motions**: Existing datasets lack diverse long - sequence motions. 4. **Non - Scalability of Manual Annotation**: The text labels of existing datasets are manually annotated, unprofessional and labor - intensive. ### Solutions: - **Develop an Automatic Annotation Pipeline**: A high - precision, cost - effective and scalable full - body motion and text annotation pipeline has been designed. - **Construct a Large - Scale Dataset**: Based on the above - mentioned pipeline, a large - scale dataset, Motion - X, which contains 15.6M accurate 3D full - body pose annotations and 81.1K motion sequences, has been constructed. - **Multi - Scene Data Collection**: A large number of videos have been collected from the Internet and existing datasets, covering games, animations, professional performances, and diverse outdoor actions. Through these methods, the paper aims to address the deficiencies of existing datasets and provide high - quality data support for future motion generation and 3D human mesh recovery research.