Abstract:Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, $DSO$-Net, combines textual $d$ecomposition and sub-motion-space $s$cattering to solve the $o$pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at <a class="link-external link-https" href="https://vankouf.github.io/DSONet/" rel="external noopener nofollow">this https URL</a>.

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Programmable Motion Generation for Open-Set Motion Control Tasks

Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

HumanTOMATO: Text-aligned Whole-body Motion Generation

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Motion Generation from Fine-grained Textual Descriptions

Generating Human Interaction Motions in Scenes with Text Control

Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

Learning Generalizable Human Motion Generator with Reinforcement Learning

Freeform Body Motion Generation from Speech

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning