Abstract:We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to generate a coherent and natural multi - human - action sequence from text descriptions. Specifically, the paper focuses on generating a human - motion sequence that can transition smoothly and is highly consistent with the text description given a series of action descriptions. This challenge is mainly reflected in two aspects: 1. **Multi - action generation**: Most of the existing methods mainly focus on the generation of a single action, while generating a sequence containing multiple consecutive actions requires dealing with the problem of smooth transitions between actions. Directly splicing single actions together often leads to unnatural transitions, affecting the coherence and authenticity of the overall action. 2. **Text - to - action alignment**: The generated actions need not only to look realistic visually, but also to be highly matched with the semantic information in the text description. This means that each generated action segment needs to accurately reflect its corresponding text description, and the entire action sequence needs to maintain consistency and coherence. To solve these problems, the paper proposes the M2D2M (Multi - Motion Discrete Diffusion Models) method, which effectively generates long - term, smooth and context - coherent multi - human - action sequences by introducing dynamic transition probabilities and a two - stage sampling strategy. Specific technical details include: - **Dynamic transition probabilities**: By considering the distance between action tokens, the transition probabilities are dynamically adjusted, so as to explore diverse actions in the early diffusion process and gradually converge to accurate actions in the later diffusion process. - **Two - stage sampling strategy**: First, generate the basic outline of the entire action sequence through joint sampling, and then refine each action segment through independent sampling to ensure that each action segment is highly consistent with its corresponding text description while maintaining a smooth transition between actions. These technical means work together, making M2D2M outperform the existing benchmark models in the multi - action generation task, demonstrating its effectiveness and superiority in generating high - quality multi - human - action sequences.

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Text-driven Human Motion Generation with Motion Masked Diffusion Model

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

Human Motion Diffusion Model

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Motion Generation from Fine-grained Textual Descriptions

MMM: Generative Masked Motion Model

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

DiffusionPhase: Motion Diffusion in Frequency Domain

Efficient Text-driven Motion Generation via Latent Consistency Training

Realistic Human Motion Generation with Cross-Diffusion Models

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

AMD: Autoregressive Motion Diffusion

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

Rethinking Diffusion for Text-Driven Human Motion Generation

Guided Motion Diffusion for Controllable Human Motion Synthesis