StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework

Yiheng Huang,Hui Yang,Chuanchen Luo,Yuxi Wang,Shibiao Xu,Zhaoxiang Zhang,Man Zhang,Junran Peng
2024-05-09
Abstract:Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page:
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper aims to address several key issues in human motion generation based on diffusion models and proposes a new framework called StableMoFusion to improve these issues. Specifically, the paper addresses the following three main problems: 1. **Lack of Systematic Analysis**: Existing motion generation methods based on diffusion models typically adopt different network architectures and training strategies, which hinders the integration across methods and the introduction of advancements from related fields. 2. **Long Inference Time**: Due to the time-consuming iterative sampling process, most existing methods are impractical in application scenarios requiring real-time response, such as virtual characters and humanoid robots. 3. **Footskate Issue**: The footskate phenomenon in generated motions is a common problem that severely damages the quality of the generated motions, limiting their potential for practical applications. To overcome the above challenges, the researchers conducted a comprehensive investigation, including an in-depth analysis of network architectures, training strategies, and inference processes. Based on these analyses, they proposed the StableMoFusion framework, which has the following features: - **Efficient Network Architecture**: An optimized Conv1D UNet is used as the denoising network, which includes Adaptive Group Normalization (AdaGN) and linear cross-attention mechanisms. - **Effective Training Strategy**: Exponential Moving Average (EMA) is used to smooth the changes in model parameters, and Classifier-Free Guidance (CFG) is employed to balance the consistency and fidelity between text and motion. - **Accelerated Inference Process**: Techniques such as efficient samplers, embedded text caching, parallel CFG computation, and low-precision inference significantly improve inference speed. - **Solution to the Footskate Issue**: A method based on mechanical models and optimization is proposed to identify and correct the footskate issue in motion sequences. Experimental results show that StableMoFusion outperforms the current state-of-the-art methods in terms of motion quality, text consistency, and inference efficiency.