Abstract:Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing methods for generating human motion videos rely on extracting pose sequences from reference videos, which limits flexibility and control ability. Moreover, due to the limitations of pose detection techniques, the extracted pose sequences may sometimes be inaccurate, resulting in low - quality video output. Therefore, the paper proposes a new task - generating human motion videos based on text and reference images, to provide greater flexibility and ease of use, because text is easier to obtain than the required guidance videos. However, training an end - to - end model to complete this task requires millions or even billions of high - quality text - human motion video pairs, which are difficult to obtain in practice.
To solve these problems, the paper proposes a new framework called **Fleximo**, which utilizes large - scale pre - trained text - to - 3D - motion models. Specifically, Fleximo achieves its goals through the following steps:
1. **Text Parsing and 3D Skeleton Generation**: Use a large - language model (such as LLaMA - 7B) to parse the input natural language into discrete motion sequences. Then, use a text - to - 3D - motion module (such as T2M - GPT) to generate the corresponding 3D skeleton vertices.
2. **3D Skeleton Projection and Scale Adjustment**: Project the generated 3D skeleton vertices onto 2D space, and use an anchor - based rescaling method to adjust the scale of the 2D skeleton to match the reference image.
3. **Skeleton Adapter**: Design a skeleton adapter to fill in missing details (such as hand and face information), thereby generating a complete 2D skeleton video.
4. **Video Generation and Optimization**: Use the generated 2D skeleton video and the reference image as guidance to generate an initial human motion video. Further improve the video quality through the video optimization process.
5. **Long - Text Processing**: Introduce a large - language model (LLM) to split the long - text input into discrete motion parts, enabling the framework to generate motion videos of any length.
To evaluate the performance of Fleximo, the paper also introduces a new benchmark dataset **MotionBench**, which contains 400 videos, covering 20 different identities and 20 different actions. At the same time, a new evaluation metric **MotionScore** is proposed to evaluate the consistency between the generated video and the input text.
In general, the main contributions of this paper include:
- Introducing a new task, that is, generating high - quality human motion videos based on text and reference images, providing a more flexible and user - friendly method.
- Proposing a new framework **Fleximo**, which avoids the need for a large amount of text - video paired data, utilizes large - scale pre - trained text - to - 3D - motion models, and bridges the gap between text - to - motion and motion - to - video generation models.
- Introducing the **MotionBench** benchmark dataset and the **MotionScore** metric to evaluate the quality and consistency of the generated motion videos.