Abstract:We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

CLaM: an Open-Source Library for Performance Evaluation of Text-driven Human Motion Generation

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

CoMA: Compositional Human Motion Generation with Multi-modal Agents

Motion Generation from Fine-grained Textual Descriptions

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Contact-aware Human Motion Generation from Textual Descriptions

Learning Generalizable Human Motion Generator with Reinforcement Learning

HumanTOMATO: Text-aligned Whole-body Motion Generation

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

T3M: Text Guided 3D Human Motion Synthesis from Speech

Generating Human Interaction Motions in Scenes with Text Control

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Generating Human Motion in 3D Scenes from Text Descriptions

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

OmniMotionGPT: Animal Motion Generation with Limited Data

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs