Abstract:Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page:

What problem does this paper attempt to address?

The paper aims to address several key issues in human motion generation based on diffusion models and proposes a new framework called StableMoFusion to improve these issues. Specifically, the paper addresses the following three main problems: 1. **Lack of Systematic Analysis**: Existing motion generation methods based on diffusion models typically adopt different network architectures and training strategies, which hinders the integration across methods and the introduction of advancements from related fields. 2. **Long Inference Time**: Due to the time-consuming iterative sampling process, most existing methods are impractical in application scenarios requiring real-time response, such as virtual characters and humanoid robots. 3. **Footskate Issue**: The footskate phenomenon in generated motions is a common problem that severely damages the quality of the generated motions, limiting their potential for practical applications. To overcome the above challenges, the researchers conducted a comprehensive investigation, including an in-depth analysis of network architectures, training strategies, and inference processes. Based on these analyses, they proposed the StableMoFusion framework, which has the following features: - **Efficient Network Architecture**: An optimized Conv1D UNet is used as the denoising network, which includes Adaptive Group Normalization (AdaGN) and linear cross-attention mechanisms. - **Effective Training Strategy**: Exponential Moving Average (EMA) is used to smooth the changes in model parameters, and Classifier-Free Guidance (CFG) is employed to balance the consistency and fidelity between text and motion. - **Accelerated Inference Process**: Techniques such as efficient samplers, embedded text caching, parallel CFG computation, and low-precision inference significantly improve inference speed. - **Solution to the Footskate Issue**: A method based on mechanical models and optimization is proposed to identify and correct the footskate issue in motion sequences. Experimental results show that StableMoFusion outperforms the current state-of-the-art methods in terms of motion quality, text consistency, and inference efficiency.

StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis

Rethinking Diffusion for Text-Driven Human Motion Generation

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Realistic Human Motion Generation with Cross-Diffusion Models

Human Motion Diffusion Model

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

Efficient Text-driven Motion Generation via Latent Consistency Training

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Taming Diffusion Probabilistic Models for Character Control

DreaMoving: A Human Video Generation Framework based on Diffusion Models

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs