RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

Zhouyingcheng Liao,Mingyuan Zhang,Wenjia Wang,Lei Yang,Taku Komura
2024-12-06
Abstract:While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the generalization problem in **human motion generation**, especially the challenges encountered when dealing with **out - of - distribution (OOD) motion generation tasks**. Specifically, existing methods perform poorly when dealing with OOD text inputs, mainly facing the following two problems: 1. **Combinatorial complexity**: The combinatorial complexity of human actions makes it difficult for the training set to cover all possible full - body actions. 2. **Description diversity**: Diverse action descriptions lead to a persistent gap between test and training prompts. To solve these problems, the authors propose a simple and effective baseline method - **RMD (Retrieval - augmented Motion Diffuse)**. This method enhances the generalization ability of action generation in the following ways: - **No additional training required**: RMD is a retrieval - enhanced method without training. - **Hierarchical retrieval strategy**: Adopt a hierarchical retrieval strategy, decomposing actions into different levels (full - body, half - body, fine - grained) to bridge the above - mentioned gap. - **Pretrained diffusion model**: Use a pretrained diffusion model to optimize the quality of synthesized actions, improving action coordination and generation diversity. ### Method overview The workflow of RMD is divided into two main stages: 1. **Action retrieval stage**: - Use a customizable external action database to select and combine relevant actions according to the text prompt input by the user. - Flexibly retrieve action segments from the database through a multi - level retrieval pipeline (full - body, half - body, fine - grained) and recombine them into complete actions. 2. **Action diffusion stage**: - Use a pretrained action diffusion model to refine the combined actions to improve action quality and diversity. - Through a noise - and - denoise scheme, first add noise to the retrieved actions during the diffusion stage, and then use the pretrained model to remove the noise under the guidance of the input text, thereby optimizing the action quality. ### Experimental results Experiments show that RMD achieves state - of - the - art performance on both the standard benchmark HumanML3D and the cross - domain dataset Mixamo, especially when dealing with OOD data. In addition, user studies also show that RMD has significant advantages when dealing with OOD text prompts and can generate more realistic actions that match the text descriptions better. ### Summary RMD provides a lightweight, easy - to - implement, and high - performance method, systematically evaluates the performance of existing methods in OOD scenarios, and sets a new benchmark for future general - purpose action generation.