Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

Peng Jin,Yang Wu,Yanbo Fan,Zhongqian Sun,Yang Wei,Li Yuan
2023-11-02
Abstract:Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at <a class="link-external link-https" href="https://github.com/jpthu17/GraphMotion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems existing in the existing text - driven human motion generation methods: 1. **Imbalance**: When extracting text features, existing models usually over - emphasize action names while ignoring other important attributes, such as direction and intensity. This imbalanced learning makes the network insensitive to subtle changes in the input text and lacks fine - grained control ability. 2. **Coarseness**: Existing methods usually use compact sentence - level representations to generate motion sequences, which results in the generated motions lacking details and being unable to precisely synthesize complex actions. Directly mapping from the high - level language space to the motion sequence further hinders the generation of fine - grained details. To overcome these problems, the author proposes a fine - grained control signal based on a hierarchical semantic graph and designs a coarse - to - fine motion diffusion model (GraphMotion). Specifically, the author decomposes the motion description into three levels of abstract nodes: overall motion, local actions, and action specifics. In this way, the model can gradually generate motions from coarse to fine, thereby achieving more precise control. ### Main Contributions - **Proposing Hierarchical Semantic Graphs**: This is a fine - grained control signal that decomposes the motion description into three levels of abstract nodes from global to local. - **Designing a Coarse - to - Fine Motion Diffusion Model**: This model decomposes the text - to - motion diffusion process into three semantic levels, capturing overall motion, local actions, and action details respectively. - **Continuously optimizing the generated motion by modifying the edge weights of the hierarchical semantic graph**: This feature enables the generated motion to be further refined and has far - reaching implications. ### Experimental Results The experiments were carried out on two benchmark datasets: HumanML3D and KIT. The results show that GraphMotion has achieved new state - of - the - art results on multiple metrics, including R - Precision, FID, MM - Dist, Diversity, and MModality. In addition, by modifying the edge weights of the hierarchical semantic graph, the model can continuously optimize the generated motion, demonstrating its strong controllability and flexibility.