Abstract:We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

Improving Fine-grained Understanding for Retrieval in Human Motion and Text

Cross-Modal Retrieval for Motion and Text via DopTriple Loss

Cross-Modal Retrieval for Motion and Text Via DropTriple Loss.

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

HSMR: A Head-Shoulder Mask Aided ResNet to Guide Focus of Re-Identification Implemented on Tour-Guide Robot.

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

A 3D Human Motion Refinement Method Based on Sparse Motion Bases Selection.

Performance-Driven Motion Retrieval and its Usability Evaluation

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Motion Generation from Fine-grained Textual Descriptions

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction

ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Human MotionFormer: Transferring Human Motions with Vision Transformers

Retrieval-Based Natural 3D Human Motion Generation

Multi-Transmotion: Pre-trained Model for Human Motion Prediction