Abstract:In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing motion datasets in terms of scale, diversity, expressiveness, and scene coverage. Specifically, the existing motion datasets mainly contain body postures and lack facial expressions, hand gestures, and detailed posture descriptions. In addition, most of these datasets are collected in limited laboratory scenarios, and the text descriptions are manually annotated, which greatly limits their extensibility. To overcome these limitations, the author has developed a full - body motion and text annotation pipeline that can automatically annotate motions from single - view or multi - view videos and provide comprehensive semantic labels for each video as well as detailed full - body posture descriptions for each frame. Based on this pipeline, a large - scale 3D full - body motion dataset, Motion - X, has been constructed, aiming to improve expressiveness, diversity, and natural motion generation capabilities, as well as the accuracy of 3D full - body human mesh recovery. ### Main Problem Summary: 1. **Limitations of Body Posture Data**: Existing datasets mainly contain body postures and lack facial expressions and hand gestures. 2. **Insufficient Data Volume and Diversity**: The data volume and diversity of existing datasets are insufficient, mainly covering indoor scenes. 3. **Lack of Long - Sequence Motions**: Existing datasets lack diverse long - sequence motions. 4. **Non - Scalability of Manual Annotation**: The text labels of existing datasets are manually annotated, unprofessional and labor - intensive. ### Solutions: - **Develop an Automatic Annotation Pipeline**: A high - precision, cost - effective and scalable full - body motion and text annotation pipeline has been designed. - **Construct a Large - Scale Dataset**: Based on the above - mentioned pipeline, a large - scale dataset, Motion - X, which contains 15.6M accurate 3D full - body pose annotations and 81.1K motion sequences, has been constructed. - **Multi - Scene Data Collection**: A large number of videos have been collected from the Internet and existing datasets, covering games, animations, professional performances, and diverse outdoor actions. Through these methods, the paper aims to address the deficiencies of existing datasets and provide high - quality data support for future motion generation and 3D human mesh recovery research.

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

MOtion Human Parsing - A New Benchmark for 3D Human Parsing.

HuMoMM: A Multi-Modal Dataset and Benchmark for Human Motion Analysis

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

New multi-view human motion capture framework

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Inter-X: Towards Versatile Human-Human Interaction Analysis

The MI-Motion Dataset and Benchmark for 3D Multi-Person Motion Prediction

MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

HardMo:A Large-scale Hardcase Dataset for Motion Capture

MMHead: Towards Fine-grained Multi-modal 3D Facial Animation

Expressive Forecasting of 3D Whole-body Human Motions

Contact-aware Human Motion Generation from Textual Descriptions

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Kinematic Dataset of Actors Expressing Emotions