Abstract:This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.

What problem does this paper attempt to address?

This paper aims to solve the problem of generating natural and diverse action sequences for multiple people from text descriptions. Specifically, although text - to - action generation for single or double persons has been widely studied, research on multi - person action generation still faces challenges, mainly due to the lack of available datasets. This paper solves this problem by creating a large - scale dataset containing multi - person postures and actions, and proposes a Transformer - based diffusion framework model that can handle any number of subjects or frames. Experiments explore the generation of multi - person static postures and the generation of multi - person action sequences. To the author's knowledge, this is the first method that can generate multi - subject action sequences with high diversity and fidelity from a large number of different text prompts. ### Main problems solved by the paper: 1. **Insufficient datasets**: Existing research on multi - person body action generation is limited by the lack of available datasets, especially datasets that can support multi - person action generation. 2. **Complexity of multi - person action generation**: Compared with single - or double - person action generation, multi - person action generation needs to handle more interactions and coordinations between subjects, which increases the complexity of generation. 3. **Open - domain text - driven**: Existing methods are usually limited to specific types of text prompts, while the goal of this paper is to generate multi - person actions from open - domain text descriptions, which means higher diversity and complexity of text prompts. ### Solutions: 1. **Dataset construction**: - **LAION - Pose**: Extract multi - person postures and text descriptions from a large - scale image dataset, containing 8 million (image, posture, text) tuples. - **WebVid - Motion**: Extract multi - person actions and text descriptions from a large - scale video dataset, containing 3,500 (video, action, text) tuples. 2. **Model design**: - **Transformer - based Diffusion Framework**: Use a Transformer - based diffusion framework that can handle multiple data sources (single / multi - person, single / multi - frame) and generate reasonable multi - person postures and action sequences. - **Two - stage generation**: - **First stage**: Generate a single frame containing multi - person postures. - **Second stage**: Generate a complete multi - person action sequence conditioned on the single - frame postures generated in the first stage. 3. **Evaluation methods**: - **Decomposed evaluation**: Due to the lack of real - world data for multi - person actions, the authors adopt a decomposed evaluation mechanism to evaluate the multi - person posture results of each single frame and the single - person action results of each single subject separately. - **Feature encoders**: Train two feature encoders, one for text - and - posture pairs and the other for text - and - action pairs for evaluation. ### Main contributions: 1. **For the first time, realize the generation of multi - person action sequences from open - domain text descriptions**. 2. **Introduce large - scale multi - person posture and action datasets** (LAION - Pose and WebVid - Motion). 3. **Design a decomposed evaluation method** to evaluate the quality of generated results in the absence of real - world multi - person action data. Through these methods, the paper successfully solves the challenges of generating multi - person action sequences from open - domain text descriptions and provides new datasets and evaluation criteria for future research.

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Plan, Posture and Go: Towards Open-World Text-to-Motion Generation

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Contact-aware Human Motion Generation from Textual Descriptions

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Generating Human Interaction Motions in Scenes with Text Control

Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Generating Holistic 3D Human Motion from Speech

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

TEMOS: Generating diverse human motions from textual descriptions

Motion Generation from Fine-grained Textual Descriptions

Human Motion Generation: A Survey

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Human Motion Transfer from Poses in the Wild

Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation