Abstract:Recent generative methods have revolutionized the way of human motion synthesis, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DMs). These methods have gained significant attention in human motion fields. However, there are still challenges in unconditionally generating highly diverse human motions from a given distribution. To enhance the diversity of synthesized human motions, previous methods usually employ deep neural networks (DNNs) to train a transport map that transforms Gaussian noise distribution into real human motion distribution. According to Figalli's regularity theory, the optimal transport map computed by DNNs frequently exhibits discontinuities. This is due to the inherent limitation of DNNs in representing only continuous maps. Consequently, the generated human motions tend to heavily concentrate on densely populated regions of the data distribution, resulting in mode collapse or mode mixture. To address the issues, we propose an efficient method called MOOT for unconditional human motion synthesis. First, we utilize a reconstruction network based on GRU and transformer to map human motions to latent space. Next, we employ convex optimization to match the noise distribution with the latent space distribution of human motions through the Optimal Transport (OT) map. Then, we combine the extended OT map with the generator of reconstruction network to generate new human motions. Thereby overcoming the issues of mode collapse and mode mixture. MOOT generates a latent code distribution that is well-behaved and highly structured, providing a strong motion prior for various applications in the field of human motion. Through qualitative and quantitative experiments, MOOT achieves state-of-the-art results surpassing the latest methods, validating its superiority in unconditional human motion generation.

Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

T2M-HiFiGPT: Generating High Quality Human Motion from Textual Descriptions with Residual Discrete Representations

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

MoMask: Generative Masked Modeling of 3D Human Motions

PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting

MMM: Generative Masked Motion Model

Motion Mamba: Efficient and Long Sequence Motion Generation

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis

KMM: Key Frame Mask Mamba for Extended Motion Generation

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

Motion Control for Enhanced Complex Action Video Generation

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance