Abstract:We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

MLUG: Bootstrapping Language-Motion Pre-Training for Unified Motion-Language Understanding and Generation

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT: Human Motion as a Foreign Language

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Large Motion Model for Unified Multi-Modal Motion Generation

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model

Unified Generative and Discriminative Training for Multi-modal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

UniMuMo: Unified Text, Music and Motion Generation

Language2Pose: Natural Language Grounded Pose Forecasting