Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Mengyi Shan,Lu Dong,Yutao Han,Yuan Yao,Tao Liu,Ifeoma Nwogu,Guo-Jun Qi,Mitch Hill
2024-07-15
Abstract:This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to solve the problem of generating natural and diverse action sequences for multiple people from text descriptions. Specifically, although text - to - action generation for single or double persons has been widely studied, research on multi - person action generation still faces challenges, mainly due to the lack of available datasets. This paper solves this problem by creating a large - scale dataset containing multi - person postures and actions, and proposes a Transformer - based diffusion framework model that can handle any number of subjects or frames. Experiments explore the generation of multi - person static postures and the generation of multi - person action sequences. To the author's knowledge, this is the first method that can generate multi - subject action sequences with high diversity and fidelity from a large number of different text prompts. ### Main problems solved by the paper: 1. **Insufficient datasets**: Existing research on multi - person body action generation is limited by the lack of available datasets, especially datasets that can support multi - person action generation. 2. **Complexity of multi - person action generation**: Compared with single - or double - person action generation, multi - person action generation needs to handle more interactions and coordinations between subjects, which increases the complexity of generation. 3. **Open - domain text - driven**: Existing methods are usually limited to specific types of text prompts, while the goal of this paper is to generate multi - person actions from open - domain text descriptions, which means higher diversity and complexity of text prompts. ### Solutions: 1. **Dataset construction**: - **LAION - Pose**: Extract multi - person postures and text descriptions from a large - scale image dataset, containing 8 million (image, posture, text) tuples. - **WebVid - Motion**: Extract multi - person actions and text descriptions from a large - scale video dataset, containing 3,500 (video, action, text) tuples. 2. **Model design**: - **Transformer - based Diffusion Framework**: Use a Transformer - based diffusion framework that can handle multiple data sources (single / multi - person, single / multi - frame) and generate reasonable multi - person postures and action sequences. - **Two - stage generation**: - **First stage**: Generate a single frame containing multi - person postures. - **Second stage**: Generate a complete multi - person action sequence conditioned on the single - frame postures generated in the first stage. 3. **Evaluation methods**: - **Decomposed evaluation**: Due to the lack of real - world data for multi - person actions, the authors adopt a decomposed evaluation mechanism to evaluate the multi - person posture results of each single frame and the single - person action results of each single subject separately. - **Feature encoders**: Train two feature encoders, one for text - and - posture pairs and the other for text - and - action pairs for evaluation. ### Main contributions: 1. **For the first time, realize the generation of multi - person action sequences from open - domain text descriptions**. 2. **Introduce large - scale multi - person posture and action datasets** (LAION - Pose and WebVid - Motion). 3. **Design a decomposed evaluation method** to evaluate the quality of generated results in the absence of real - world multi - person action data. Through these methods, the paper successfully solves the challenges of generating multi - person action sequences from open - domain text descriptions and provides new datasets and evaluation criteria for future research.