Abstract:The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in text - to - human - body - animation generation, existing methods can only generate body movement data, but are unable to generate facial expressions and hand movements. This is mainly due to the lack of a comprehensive full - body movement dataset. Existing methods for creating such datasets either have inconsistent movements between different body parts in artificially enhanced data, or the quality of data extracted from RGB videos is low. Therefore, the paper proposes a two - stage method named T2M - X, aiming to learn expressive text - to - movement generation from partially - annotated data. T2M - X trains three independent vector - quantized variational auto - encoders (VQ - VAEs) for high - quality data sources of the body, hands, and face respectively to ensure high - quality movement output, and uses a multi - index generative pre - trained Transformer (GPT) model with movement - consistency loss to generate movements and coordinate movements between different body parts. Experimental results show that compared with the baseline methods, T2M - X has significant improvements both quantitatively and qualitatively, demonstrating its robustness to dataset limitations. ### Main contributions of the paper: 1. **Create a unified high - quality movement dataset**: By standardizing the data format and implementing movement and text enhancement, detailed text descriptions are generated. 2. **Propose a two - stage movement generation process**: It includes three VQ - VAE expert models (trained on partially - annotated data) and a multi - index GPT model (used to generate movement sequences based on text descriptions). 3. **Achieve movement - consistency loss of body parts in the joint space**: Ensure movement consistency between all body parts throughout the training process. ### Method overview: - **VQ - VAE expert models**: Each model is responsible for decoding the movements of different body parts and converting them into index sequences. - **Multi - index GPT model**: Predict full - body movement sequences according to the text, and the VQ - VAE decoder converts them into movement data. - **Consistency learning**: Ensure movement consistency between different modalities through feature extractors in the joint space and consistency loss. - **Movement jitter mitigation**: Reduce movement jitter in low - quality datasets through low - pass filtering and a new pose representation method. ### Experimental setup: - **Dataset**: Randomly shuffle the entire dataset and divide it into training, validation, and test sets, with proportions of 80%, 10%, and 10% respectively. - **Implementation details**: Use the AdamW optimizer, with a batch size of 256 and a learning rate of 1e - 4. The codebook size of the VQ - VAE expert models is 512×512, and the down - sampling rate is 4. The GPT model uses 9 - layer Transformers, with a hidden dimension of 512 and 16 heads. ### Evaluation metrics: - **Frechet Inception Distance (FID)**: The distribution distance of real movements and generated movements on the extracted movement features. - **Diversity**: The average Euclidean distance of 300 randomly selected pairs of movement features from a set. - **Multimodality**: The average Euclidean distance of 10 pairs of movement features generated from the same text description. - **Multimodal Distance**: The average Euclidean distance of 10 pairs of movement features generated from the same text description. ### Experimental results: - T2M - X outperforms the baseline methods on multiple evaluation metrics, especially in terms of FID and multimodality, indicating that the movements it generates are more natural and diverse. Through these methods and experiments, the paper successfully overcomes the limitations of existing text - to - movement generation methods and provides a new solution for generating high - quality full - body movements.

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

HumanTOMATO: Text-aligned Whole-body Motion Generation

Contact-aware Human Motion Generation from Textual Descriptions

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

TAAT: Think and Act from Arbitrary Texts in Text2Motion

T2M-HiFiGPT: Generating High Quality Human Motion from Textual Descriptions with Residual Discrete Representations

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

Motion Generation from Fine-grained Textual Descriptions

Motion Control for Enhanced Complex Action Video Generation

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Learning Generalizable Human Motion Generator with Reinforcement Learning

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

MMHead: Towards Fine-grained Multi-modal 3D Facial Animation