Abstract:Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at <a class="link-external link-https" href="https://github.com/jpthu17/GraphMotion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two main problems existing in the existing text - driven human motion generation methods: 1. **Imbalance**: When extracting text features, existing models usually over - emphasize action names while ignoring other important attributes, such as direction and intensity. This imbalanced learning makes the network insensitive to subtle changes in the input text and lacks fine - grained control ability. 2. **Coarseness**: Existing methods usually use compact sentence - level representations to generate motion sequences, which results in the generated motions lacking details and being unable to precisely synthesize complex actions. Directly mapping from the high - level language space to the motion sequence further hinders the generation of fine - grained details. To overcome these problems, the author proposes a fine - grained control signal based on a hierarchical semantic graph and designs a coarse - to - fine motion diffusion model (GraphMotion). Specifically, the author decomposes the motion description into three levels of abstract nodes: overall motion, local actions, and action specifics. In this way, the model can gradually generate motions from coarse to fine, thereby achieving more precise control. ### Main Contributions - **Proposing Hierarchical Semantic Graphs**: This is a fine - grained control signal that decomposes the motion description into three levels of abstract nodes from global to local. - **Designing a Coarse - to - Fine Motion Diffusion Model**: This model decomposes the text - to - motion diffusion process into three semantic levels, capturing overall motion, local actions, and action details respectively. - **Continuously optimizing the generated motion by modifying the edge weights of the hierarchical semantic graph**: This feature enables the generated motion to be further refined and has far - reaching implications. ### Experimental Results The experiments were carried out on two benchmark datasets: HumanML3D and KIT. The results show that GraphMotion has achieved new state - of - the - art results on multiple metrics, including R - Precision, FID, MM - Dist, Diversity, and MModality. In addition, by modifying the edge weights of the hierarchical semantic graph, the model can continuously optimize the generated motion, demonstrating its strong controllability and flexibility.

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Motion Generation from Fine-grained Textual Descriptions

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Neural Motion Graph.

GUESS: GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Rethinking Diffusion for Text-Driven Human Motion Generation

Human Motion Diffusion as a Generative Prior

FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

Realistic Human Motion Generation with Cross-Diffusion Models

Searching Motion Graphs for Human Motion Synthesis.

DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting