Abstract:We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

Executing Your Commands Via Motion Diffusion in Latent Space

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Multi-person/Group Interactive Video Generation

Efficient Text-driven Motion Generation via Latent Consistency Training

Multi-person/Group Interactive Video Generation

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Human Motion Diffusion as a Generative Prior

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

Human Motion Diffusion Model

Rethinking Diffusion for Text-Driven Human Motion Generation

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Move-in-2D: 2D-Conditioned Human Motion Generation

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data